Skip to content

Commit 9c15abf

Browse files
authored
Refactor fastapi-serving and add one card serving(#11581)
* init fastapi-serving one card * mv api code to source * update worker * update for style-check * add worker * update bash * update * update worker name and add readme * rename update * rename to fastapi
1 parent 373ccbb commit 9c15abf

File tree

19 files changed

+583
-367
lines changed

19 files changed

+583
-367
lines changed

docker/llm/inference/xpu/docker/Dockerfile

+1-1
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRO
6161
cp -r ./ipex-llm/python/llm/example/GPU/vLLM-Serving/ ./vLLM-Serving && \
6262
# Download pp_serving
6363
mkdir -p /llm/pp_serving && \
64-
cp ./ipex-llm/python/llm/example/GPU/Pipeline-Parallel-FastAPI/*.py /llm/pp_serving/ && \
64+
cp ./ipex-llm/python/llm/example/GPU/Pipeline-Parallel-Serving/*.py /llm/pp_serving/ && \
6565
# Install related library of benchmarking
6666
pip install pandas omegaconf && \
6767
chmod +x /llm/benchmark.sh && \

python/llm/example/GPU/Pipeline-Parallel-FastAPI/pipeline_serving.py

-346
This file was deleted.

python/llm/example/GPU/Pipeline-Parallel-FastAPI/README.md renamed to python/llm/example/GPU/Pipeline-Parallel-Serving/README.md

+10-3
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,14 @@ pip install transformers==4.40.0
5050
pip install trl==0.8.1
5151
```
5252

53-
### 2. Run pipeline parallel serving on multiple GPUs
53+
### 2-1. Run ipex-llm serving on one GPU card
54+
55+
```bash
56+
# Need to set NUM_GPUS=1 and MODEL_PATH in run.sh first
57+
bash run.sh
58+
```
59+
60+
### 2-2. Run pipeline parallel serving on multiple GPUs
5461

5562
```bash
5663
# Need to set MODEL_PATH in run.sh first
@@ -76,7 +83,7 @@ export http_proxy=
7683
export https_proxy=
7784

7885
curl -X 'POST' \
79-
'http://127.0.0.1:8000/generate/' \
86+
'http://127.0.0.1:8000/generate' \
8087
-H 'accept: application/json' \
8188
-H 'Content-Type: application/json' \
8289
-d '{
@@ -99,7 +106,7 @@ Please change the test url accordingly.
99106

100107
```bash
101108
# set t/c to the number of concurrencies to test full throughput.
102-
wrk -t1 -c1 -d5m -s ./wrk_script_1024.lua http://127.0.0.1:8000/generate/ --timeout 1m
109+
wrk -t1 -c1 -d5m -s ./wrk_script_1024.lua http://127.0.0.1:8000/generate --timeout 1m
103110
```
104111

105112
## 5. Using the `benchmark.py` Script

0 commit comments

Comments
 (0)