You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Generate the toctree for trtllm-serve example scripts
99
+
trtllm_serve_content="Refer to the `trtllm-serve documentation <https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve.html>`_ for starting a server.\n\n"
Copy file name to clipboardExpand all lines: examples/apps/README.md
-39
Original file line number
Diff line number
Diff line change
@@ -1,43 +1,4 @@
1
1
# Apps examples with GenerationExecutor / LLM API
2
-
## OpenAI API
3
-
The `trtllm-serve` command launches an OpenAI compatible server which supports `v1/version`, `v1/completions` and `v1/chat/completions`. [openai_client.py](./openai_client.py) is a simple example using OpenAI client to query your model. To start the server, you can run
4
-
```
5
-
trtllm-serve <model>
6
-
```
7
-
Then you can query the APIs by running our example client or by `curl`.
8
-
### v1/completions
9
-
Query by `curl`:
10
-
```
11
-
curl http://localhost:8000/v1/completions \
12
-
-H "Content-Type: application/json" \
13
-
-d '{
14
-
"model": <model_name>,
15
-
"prompt": "Where is New York?",
16
-
"max_tokens": 16,
17
-
"temperature": 0
18
-
}'
19
-
```
20
-
Query by our example:
21
-
```
22
-
python3 ./openai_client.py --prompt "Where is New York?" --api completions
23
-
```
24
-
### v1/chat/completions
25
-
Query by `curl`:
26
-
```
27
-
curl http://localhost:8000/v1/chat/completions \
28
-
-H "Content-Type: application/json" \
29
-
-d '{
30
-
"model": <model_name>,
31
-
"messages":[{"role": "system", "content": "You are a helpful assistant."},
32
-
{"role": "user", "content": "Where is New York?"}],
33
-
"max_tokens": 16,
34
-
"temperature": 0
35
-
}'
36
-
```
37
-
Query by our example:
38
-
```
39
-
python3 ./openai_client.py --prompt "Where is New York?" --api chat
40
-
```
41
2
## Python chat
42
3
43
4
[chat.py](./chat.py) provides a small examples to play around with your model. Before running, install additional requirements with ` pip install -r ./requirements.txt`. Then you can run it with
Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/) and [examples](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/) for detailed information and usage guidelines regarding the LLM API.
3
+
Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/), [examples](https://nvidia.github.io/TensorRT-LLM/examples/llm_api_examples.html) and [customization](https://nvidia.github.io/TensorRT-LLM/examples/customization.html) for detailed information and usage guidelines regarding the LLM API.
We provide a CLI command, `trtllm-serve`, to launch a FastAPI server compatible with OpenAI APIs, here are some client examples to query the server, you can check the source code here or refer to the [command documentation](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve.html) and [examples](https://nvidia.github.io/TensorRT-LLM/examples/trtllm_serve_examples.html) for detailed information and usage guidelines.
0 commit comments