You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/server/README.md
+161-91
Original file line number
Diff line number
Diff line change
@@ -345,7 +345,7 @@ node index.js
345
345
346
346
> [!IMPORTANT]
347
347
>
348
-
> This endpoint is **not** OAI-compatible
348
+
> This endpoint is **not** OAI-compatible. For OAI-compatible client, use `/v1/completions` instead.
349
349
350
350
*Options:*
351
351
@@ -523,6 +523,7 @@ These words will not be included in the completion, so make sure to add them to
523
523
- `tokens_evaluated`: Number of tokens evaluated in total from the prompt
524
524
- `truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`)
525
525
526
+
526
527
### POST `/tokenize`: Tokenize a given text
527
528
528
529
*Options:*
@@ -574,6 +575,10 @@ With input 'á' (utf8 hex: C3 A1) on tinyllama/stories260k
574
575
575
576
### POST `/embedding`: Generate embedding of a given text
576
577
578
+
> [!IMPORTANT]
579
+
>
580
+
> This endpoint is **not** OAI-compatible. For OAI-compatible client, use `/v1/embeddings` instead.
581
+
577
582
The same as [the embedding example](../embedding) does.
578
583
579
584
*Options:*
@@ -744,96 +749,6 @@ To use this endpoint with POST method, you need to start server with `--props`
744
749
745
750
- None yet
746
751
747
-
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
748
-
749
-
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
750
-
751
-
*Options:*
752
-
753
-
See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). While some OpenAI-specific features such as function calling aren't supported, llama.cpp `/completion`-specific features such as `mirostat` are supported.
754
-
755
-
The `response_format` parameter supports both plain JSON output (e.g. `{"type": "json_object"}`) and schema-constrained JSON (e.g. `{"type": "json_object", "schema": {"type": "string", "minLength": 10, "maxLength": 100}}` or `{"type": "json_schema", "schema": {"properties": { "name": { "title": "Name", "type": "string" }, "date": { "title": "Date", "type": "string" }, "participants": { "items": {"type: "string" }, "title": "Participants", "type": "string" } } } }`), similar to other OpenAI-inspired API providers.
756
-
757
-
*Examples:*
758
-
759
-
You can use either Python `openai` library with appropriate checkpoints:
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
773
-
{"role": "user", "content": "Write a limerick about python exceptions"}
774
-
]
775
-
)
776
-
777
-
print(completion.choices[0].message)
778
-
```
779
-
780
-
... or raw HTTP requests:
781
-
782
-
```shell
783
-
curl http://localhost:8080/v1/chat/completions \
784
-
-H "Content-Type: application/json" \
785
-
-H "Authorization: Bearer no-key" \
786
-
-d '{
787
-
"model": "gpt-3.5-turbo",
788
-
"messages": [
789
-
{
790
-
"role": "system",
791
-
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
792
-
},
793
-
{
794
-
"role": "user",
795
-
"content": "Write a limerick about python exceptions"
796
-
}
797
-
]
798
-
}'
799
-
```
800
-
801
-
### POST `/v1/embeddings`: OpenAI-compatible embeddings API
802
-
803
-
This endpoint requires that the model uses a pooling different than type `none`. The embeddings are normalized using the Eucledian norm.
804
-
805
-
*Options:*
806
-
807
-
See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings).
808
-
809
-
*Examples:*
810
-
811
-
- input as string
812
-
813
-
```shell
814
-
curl http://localhost:8080/v1/embeddings \
815
-
-H "Content-Type: application/json" \
816
-
-H "Authorization: Bearer no-key" \
817
-
-d '{
818
-
"input": "hello",
819
-
"model":"GPT-4",
820
-
"encoding_format": "float"
821
-
}'
822
-
```
823
-
824
-
- `input`as string array
825
-
826
-
```shell
827
-
curl http://localhost:8080/v1/embeddings \
828
-
-H "Content-Type: application/json" \
829
-
-H "Authorization: Bearer no-key" \
830
-
-d '{
831
-
"input": ["hello", "world"],
832
-
"model":"GPT-4",
833
-
"encoding_format": "float"
834
-
}'
835
-
```
836
-
837
752
### POST `/embeddings`: non-OpenAI-compatible embeddings API
838
753
839
754
This endpoint supports all poolings, including `--pooling none`. When the pooling is `none`, the responses will contain the *unnormalized* embeddings for *all* input tokens. For all other pooling types, only the pooled embeddings are returned, normalized using Euclidian norm.
@@ -1064,6 +979,161 @@ To know the `id` of the adapter, use GET `/lora-adapters`
1064
979
]
1065
980
```
1066
981
982
+
## OpenAI-compatible API Endpoints
983
+
984
+
### GET `/v1/models`: OpenAI-compatible Model Info API
985
+
986
+
Returns information about the loaded model. See [OpenAI Models API documentation](https://platform.openai.com/docs/api-reference/models).
987
+
988
+
The returned list always has one single element.
989
+
990
+
By default, model `id` field is the path to model file, specified via `-m`. You can set a custom value for model `id` field via `--alias` argument. For example, `--alias gpt-4o-mini`.
### POST `/v1/completions`: OpenAI-compatible Completions API
1017
+
1018
+
Given an input `prompt`, it returns the predicted completion. Streaming mode is also supported. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps.
1019
+
1020
+
*Options:*
1021
+
1022
+
See [OpenAI Completions API documentation](https://platform.openai.com/docs/api-reference/completions).
1023
+
1024
+
llama.cpp `/completion`-specific features such as `mirostat` are supported.
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
1048
+
1049
+
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
1050
+
1051
+
*Options:*
1052
+
1053
+
See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). While some OpenAI-specific features such as function calling aren't supported, llama.cpp `/completion`-specific features such as `mirostat` are supported.
1054
+
1055
+
The `response_format` parameter supports both plain JSON output (e.g. `{"type": "json_object"}`) and schema-constrained JSON (e.g. `{"type": "json_object", "schema": {"type": "string", "minLength": 10, "maxLength": 100}}` or `{"type": "json_schema", "schema": {"properties": { "name": { "title": "Name", "type": "string" }, "date": { "title": "Date", "type": "string" }, "participants": { "items": {"type: "string" }, "title": "Participants", "type": "string" } } } }`), similar to other OpenAI-inspired API providers.
1056
+
1057
+
*Examples:*
1058
+
1059
+
You can use either Python `openai` library with appropriate checkpoints:
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
1073
+
{"role": "user", "content": "Write a limerick about python exceptions"}
1074
+
]
1075
+
)
1076
+
1077
+
print(completion.choices[0].message)
1078
+
```
1079
+
1080
+
... or raw HTTP requests:
1081
+
1082
+
```shell
1083
+
curl http://localhost:8080/v1/chat/completions \
1084
+
-H "Content-Type: application/json" \
1085
+
-H "Authorization: Bearer no-key" \
1086
+
-d '{
1087
+
"model": "gpt-3.5-turbo",
1088
+
"messages": [
1089
+
{
1090
+
"role": "system",
1091
+
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
1092
+
},
1093
+
{
1094
+
"role": "user",
1095
+
"content": "Write a limerick about python exceptions"
1096
+
}
1097
+
]
1098
+
}'
1099
+
```
1100
+
1101
+
### POST `/v1/embeddings`: OpenAI-compatible embeddings API
1102
+
1103
+
This endpoint requires that the model uses a pooling different than type `none`. The embeddings are normalized using the Eucledian norm.
1104
+
1105
+
*Options:*
1106
+
1107
+
See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings).
0 commit comments