Skip to content

Commit f25c7ce

Browse files
LinPolykaiyux
andauthored
doc: refactor trtllm-serve examples and doc (NVIDIA#3187)
Signed-off-by: Pengyun Lin <[email protected]> Signed-off-by: Kaiyu Xie <[email protected]> Co-authored-by: Kaiyu Xie <[email protected]>
1 parent bb6c338 commit f25c7ce

18 files changed

+228
-213
lines changed

docs/source/commands/trtllm-serve.rst

+27-20
Original file line numberDiff line numberDiff line change
@@ -34,30 +34,37 @@ For the full syntax and argument descriptions, refer to :ref:`syntax`.
3434
Inference Endpoints
3535
-------------------
3636

37-
After you start the server, you can send inference requests as shown in the following examples:
37+
After you start the server, you can send inference requests through completions API and Chat API, which are compatible with corresponding OpenAI APIs.
3838

39-
.. code-block:: bash
39+
Chat API
40+
~~~~~~~~
4041

41-
curl http://localhost:8000/v1/completions \
42-
-H "Content-Type: application/json" \
43-
-d '{
44-
"model": <model>,
45-
"prompt": "Where is New York?",
46-
"max_tokens": 16,
47-
"temperature": 0
48-
}'
42+
You can query Chat API with any http clients, a typical example is OpenAI Python client:
4943

50-
.. code-block:: bash
44+
.. literalinclude:: ../../../examples/serve/openai_chat_client.py
45+
:language: python
46+
:linenos:
47+
48+
Another example uses ``curl``:
49+
50+
.. literalinclude:: ../../../examples/serve/curl_chat_client.sh
51+
:language: bash
52+
:linenos:
53+
54+
Completions API
55+
~~~~~~~~~~~~~~~
56+
57+
You can query Completions API with any http clients, a typical example is OpenAI Python client:
58+
59+
.. literalinclude:: ../../../examples/serve/openai_completion_client.py
60+
:language: python
61+
:linenos:
62+
63+
Another example uses ``curl``:
5164

52-
curl http://localhost:8000/v1/chat/completions \
53-
-H "Content-Type: application/json" \
54-
-d '{
55-
"model": <model>,
56-
"messages":[{"role": "system", "content": "You are a helpful assistant."},
57-
{"role": "user", "content": "Where is New York?"}],
58-
"max_tokens": 16,
59-
"temperature": 0
60-
}'
65+
.. literalinclude:: ../../../examples/serve/curl_completion_client.sh
66+
:language: bash
67+
:linenos:
6168

6269
Metrics Endpoint
6370
----------------

docs/source/llm-api-examples/customization.md renamed to docs/source/examples/customization.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Common Customizations
1+
# LLM Common Customizations
22

33
## Quantization
44

docs/source/llm-api-examples/llm_examples_index.template.rst_ renamed to docs/source/examples/llm_examples_index.template.rst_

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Examples
1+
%EXAMPLE_NAME%
22
=================================
33

44
.. toctree::

docs/source/helper.py

+75-49
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import logging
22
import math
3+
from itertools import chain
34
from pathlib import Path
45

56

@@ -20,69 +21,94 @@ def generate_title(filename: str) -> str:
2021

2122
def generate_examples():
2223
root_dir = Path(__file__).parent.parent.parent.resolve()
24+
ignore_list = {'__init__.py', 'quickstart_example.py'}
25+
doc_dir = root_dir / "docs/source/examples"
2326

24-
# Source paths
25-
script_dir = root_dir / "examples/llm-api"
26-
# Look for both Python files and shell scripts
27-
py_script_paths = sorted(
28-
script_dir.glob("*.py"),
27+
# Source paths for LLMAPI examples
28+
llmapi_script_dir = root_dir / "examples/llm-api"
29+
llmapi_script_paths = sorted(
30+
llmapi_script_dir.glob("*.py"),
2931
# The autoPP example should be at the end since it is a preview example
3032
key=lambda x: math.inf if 'llm_auto_parallel' in x.stem else 0)
31-
32-
sh_script_paths = sorted(script_dir.glob("*.sh"))
33-
34-
# Combine both file types
35-
script_paths = py_script_paths + sh_script_paths
36-
37-
ignore_list = {'__init__.py', 'quickstart_example.py'}
38-
script_paths = [i for i in script_paths if i.name not in ignore_list]
39-
# Destination paths
40-
doc_dir = root_dir / "docs/source/llm-api-examples"
41-
doc_paths = [doc_dir / f"{path.stem}.rst" for path in script_paths]
33+
llmapi_script_paths += sorted(llmapi_script_dir.glob("*.sh"))
34+
35+
llmapi_script_paths = [
36+
i for i in llmapi_script_paths if i.name not in ignore_list
37+
]
38+
# Destination paths for LLMAPI examples
39+
llmapi_doc_paths = [
40+
doc_dir / f"{path.stem}.rst" for path in llmapi_script_paths
41+
]
42+
llmapi_script_base_url = "https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llm-api"
43+
44+
# Path for trtllm-serve examples
45+
serve_script_dir = root_dir / "examples/serve"
46+
serve_script_paths = sorted(
47+
chain(serve_script_dir.glob("*.py"), serve_script_dir.glob("*.sh")))
48+
serve_script_paths = [
49+
i for i in serve_script_paths if i.name not in ignore_list
50+
]
51+
serve_doc_paths = [
52+
doc_dir / f"{path.stem}.rst" for path in serve_script_paths
53+
]
54+
serve_script_base_url = "https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/serve"
4255

4356
# Generate the example docs for each example script
44-
for script_path, doc_path in zip(script_paths, doc_paths):
45-
if script_path.name in ignore_list:
46-
logging.warning(f"Ignoring file: {script_path.name}")
47-
continue
48-
script_url = f"https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llm-api/{script_path.name}"
49-
50-
# Determine language based on file extension
51-
language = "python" if script_path.suffix == ".py" else "bash"
52-
53-
# Make script_path relative to doc_path and call it include_path
54-
include_path = '../../..' / script_path.relative_to(root_dir)
55-
56-
# For Python files, use generate_title to extract title from comments
57-
# For shell scripts, use filename as title
58-
if script_path.suffix == ".py":
59-
title = generate_title(script_path)
60-
else:
61-
# Create a title from the filename (remove extension and replace underscores with spaces)
62-
title_text = script_path.stem.replace('_', ' ').title()
63-
title = underline(title_text)
64-
65-
content = (f"{title}\n\n"
66-
f"Source {script_url}.\n\n"
67-
f".. literalinclude:: {include_path}\n"
68-
f" :language: {language}\n"
69-
" :linenos:\n")
70-
with open(doc_path, "w+") as f:
71-
f.write(content)
72-
73-
# Generate the toctree for the example scripts
57+
def write_script(base_url, script_paths, doc_paths, extra_content=""):
58+
for script_path, doc_path in zip(script_paths, doc_paths):
59+
if script_path.name in ignore_list:
60+
logging.warning(f"Ignoring file: {script_path.name}")
61+
continue
62+
script_url = f"{base_url}/{script_path.name}"
63+
64+
# Determine language based on file extension
65+
language = "python" if script_path.suffix == ".py" else "bash"
66+
67+
# Make script_path relative to doc_path and call it include_path
68+
include_path = '../../..' / script_path.relative_to(root_dir)
69+
70+
# For Python files, use generate_title to extract title from comments
71+
# For shell scripts, use filename as title
72+
if script_path.suffix == ".py":
73+
title = generate_title(script_path)
74+
else:
75+
# Create a title from the filename (remove extension and replace underscores with spaces)
76+
title_text = script_path.stem.replace('_', ' ').title()
77+
title = underline(title_text)
78+
79+
content = (f"{title}\n\n"
80+
f"{extra_content}"
81+
f"Source {script_url}.\n\n"
82+
f".. literalinclude:: {include_path}\n"
83+
f" :language: {language}\n"
84+
" :linenos:\n")
85+
with open(doc_path, "w+") as f:
86+
f.write(content)
87+
88+
# Generate the toctree for LLMAPI example scripts
89+
write_script(llmapi_script_base_url, llmapi_script_paths, llmapi_doc_paths)
7490
with open(doc_dir / "llm_examples_index.template.rst_") as f:
7591
examples_index = f.read()
7692

7793
with open(doc_dir / "llm_api_examples.rst", "w+") as f:
78-
example_docs = "\n ".join(path.stem for path in script_paths)
79-
f.write(examples_index.replace(r"%EXAMPLE_DOCS%", example_docs))
94+
example_docs = "\n ".join(path.stem for path in llmapi_script_paths)
95+
f.write(examples_index.replace(r"%EXAMPLE_DOCS%", example_docs)\
96+
.replace(r"%EXAMPLE_NAME%", "LLM Examples"))
97+
98+
# Generate the toctree for trtllm-serve example scripts
99+
trtllm_serve_content = "Refer to the `trtllm-serve documentation <https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve.html>`_ for starting a server.\n\n"
100+
write_script(serve_script_base_url, serve_script_paths, serve_doc_paths,
101+
trtllm_serve_content)
102+
with open(doc_dir / "trtllm_serve_examples.rst", "w+") as f:
103+
example_docs = "\n ".join(path.stem for path in serve_script_paths)
104+
f.write(examples_index.replace(r"%EXAMPLE_DOCS%", example_docs)\
105+
.replace(r"%EXAMPLE_NAME%", "Online Serving Examples"))
80106

81107
with open(doc_dir / "index.rst") as f:
82108
examples_index = f.read()
83109

84110
with open(doc_dir / "index.rst", "w+") as f:
85-
example_docs = "\n ".join(path.stem for path in script_paths)
111+
example_docs = "\n ".join(path.stem for path in llmapi_script_paths)
86112
f.write(examples_index.replace(r"%EXAMPLE_DOCS%", example_docs))
87113

88114

docs/source/index.rst

+5-4
Original file line numberDiff line numberDiff line change
@@ -41,12 +41,13 @@ Welcome to TensorRT-LLM's Documentation!
4141

4242
.. toctree::
4343
:maxdepth: 2
44-
:caption: LLM API Examples
44+
:caption: Examples
4545
:hidden:
4646

47-
llm-api-examples/index.rst
48-
llm-api-examples/customization.md
49-
llm-api-examples/llm_api_examples
47+
examples/index.rst
48+
examples/customization.md
49+
examples/llm_api_examples
50+
examples/trtllm_serve_examples
5051

5152

5253
.. toctree::

examples/apps/README.md

-39
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,4 @@
11
# Apps examples with GenerationExecutor / LLM API
2-
## OpenAI API
3-
The `trtllm-serve` command launches an OpenAI compatible server which supports `v1/version`, `v1/completions` and `v1/chat/completions`. [openai_client.py](./openai_client.py) is a simple example using OpenAI client to query your model. To start the server, you can run
4-
```
5-
trtllm-serve <model>
6-
```
7-
Then you can query the APIs by running our example client or by `curl`.
8-
### v1/completions
9-
Query by `curl`:
10-
```
11-
curl http://localhost:8000/v1/completions \
12-
-H "Content-Type: application/json" \
13-
-d '{
14-
"model": <model_name>,
15-
"prompt": "Where is New York?",
16-
"max_tokens": 16,
17-
"temperature": 0
18-
}'
19-
```
20-
Query by our example:
21-
```
22-
python3 ./openai_client.py --prompt "Where is New York?" --api completions
23-
```
24-
### v1/chat/completions
25-
Query by `curl`:
26-
```
27-
curl http://localhost:8000/v1/chat/completions \
28-
-H "Content-Type: application/json" \
29-
-d '{
30-
"model": <model_name>,
31-
"messages":[{"role": "system", "content": "You are a helpful assistant."},
32-
{"role": "user", "content": "Where is New York?"}],
33-
"max_tokens": 16,
34-
"temperature": 0
35-
}'
36-
```
37-
Query by our example:
38-
```
39-
python3 ./openai_client.py --prompt "Where is New York?" --api chat
40-
```
412
## Python chat
423

434
[chat.py](./chat.py) provides a small examples to play around with your model. Before running, install additional requirements with ` pip install -r ./requirements.txt`. Then you can run it with

examples/apps/openai_client.py

-88
This file was deleted.

examples/llm-api/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
# LLM API Examples
22

3-
Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/) and [examples](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/) for detailed information and usage guidelines regarding the LLM API.
3+
Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/), [examples](https://nvidia.github.io/TensorRT-LLM/examples/llm_api_examples.html) and [customization](https://nvidia.github.io/TensorRT-LLM/examples/customization.html) for detailed information and usage guidelines regarding the LLM API.

examples/serve/README.md

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Online Serving Examples with `trtllm-serve`
2+
3+
We provide a CLI command, `trtllm-serve`, to launch a FastAPI server compatible with OpenAI APIs, here are some client examples to query the server, you can check the source code here or refer to the [command documentation](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve.html) and [examples](https://nvidia.github.io/TensorRT-LLM/examples/trtllm_serve_examples.html) for detailed information and usage guidelines.

examples/serve/curl_chat_client.sh

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#! /usr/bin/env bash
2+
3+
curl http://localhost:8000/v1/chat/completions \
4+
-H "Content-Type: application/json" \
5+
-d '{
6+
"model": TinyLlama-1.1B-Chat-v1.0,
7+
"messages":[{"role": "system", "content": "You are a helpful assistant."},
8+
{"role": "user", "content": "Where is New York?"}],
9+
"max_tokens": 16,
10+
"temperature": 0
11+
}'

0 commit comments

Comments
 (0)