You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* server : add "tokens" output
ggml-ci
* server : update readme
ggml-ci
* server : return tokens ids only if requested
ggml-ci
* tests : improve "tokens" type check
Co-authored-by: Xuan Son Nguyen <[email protected]>
* server : remove "tokens" from the OAI endpoint
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <[email protected]>
Copy file name to clipboardExpand all lines: examples/server/README.md
+6-2
Original file line number
Diff line number
Diff line change
@@ -438,19 +438,22 @@ These words will not be included in the completion, so make sure to add them to
438
438
439
439
`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `true`
440
440
441
+
`return_tokens`: Return the raw generated token ids in the `tokens` field. Otherwise `tokens` remains empty. Default: `false`
442
+
441
443
`samplers`: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. Default: `["dry", "top_k", "typ_p", "top_p", "min_p", "xtc", "temperature"]` - these are all the available values.
442
444
443
445
`timings_per_token`: Include prompt processing and text generation speed information in each response. Default: `false`
444
446
445
447
**Response format**
446
448
447
-
- Note: In streaming mode (`stream`), only `content` and `stop` will be returned until end of completion. Responses are sent using the [Server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html) standard. Note: the browser's `EventSource` interface cannot be used due to its lack of `POST` request support.
449
+
- Note: In streaming mode (`stream`), only `content`, `tokens` and `stop` will be returned until end of completion. Responses are sent using the [Server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html) standard. Note: the browser's `EventSource` interface cannot be used due to its lack of `POST` request support.
448
450
449
451
- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure:
450
452
451
453
```json
452
454
{
453
-
"content": "<the token selected by the model>",
455
+
"content": "<the token generated by the model>",
456
+
"tokens": [ generated token ids if requested ],
454
457
"probs": [
455
458
{
456
459
"prob": float,
@@ -468,6 +471,7 @@ These words will not be included in the completion, so make sure to add them to
468
471
Notice that each `probs` is an array of length `n_probs`.
469
472
470
473
- `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string.
474
+
- `tokens`: Same as `content` but represented as raw token ids. Only populated if `"return_tokens": true` or `"stream": true` in the request.
471
475
- `stop`: Boolean for use with `stream` to check whether the generation has stopped (Note: This is not related to stopping words array `stop` from input options)
472
476
- `generation_settings`: The provided options above excluding `prompt` but including `n_ctx`, `model`. These options may differ from the original ones in some way (e.g. bad values filtered out, strings converted to tokens, etc.).
0 commit comments