Skip to content

Fix openai-server example #1228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/openai-server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Set up the server:
```
python examples/openai-server/server.py --model zoo:nlg/text_generation/codegen_mono-350m/pytorch/huggingface/bigpython_bigquery_thepile/base-none
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2023-08-07 17:18:32 __main__ INFO args: Namespace(model='zoo:nlg/text_generation/codegen_mono-350m/pytorch/huggingface/bigpython_bigquery_thepile/base-none', max_model_len=512, prompt_sequence_length=1, internal_kv_cache=False, host='localhost', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None)
2023-08-07 17:18:32 __main__ INFO args: Namespace(model='zoo:nlg/text_generation/codegen_mono-350m/pytorch/huggingface/bigpython_bigquery_thepile/base-none', max_model_len=512, prompt_sequence_length=16, host='localhost', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None)
2023-08-07 17:18:32 deepsparse.transformers WARNING The neuralmagic fork of transformers may not be installed. It can be installed via `pip install nm_transformers`
Using pad_token, but it is not set yet.
2023-08-07 17:18:34 deepsparse.transformers.engines.nl_decoder_engine INFO Overwriting in-place the input shapes of the transformer model at /home/mgoin/.cache/sparsezoo/neuralmagic/codegen_mono-350m-bigpython_bigquery_thepile-base/model.onnx
Expand Down
13 changes: 1 addition & 12 deletions examples/openai-server/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,14 +79,12 @@ def __init__(
model: str,
sequence_length: int = 512,
prompt_sequence_length: int = 64,
internal_kv_cache: bool = False,
):
self.engine = deepsparse.Pipeline.create(
task="text-generation",
model_path=model,
sequence_length=sequence_length,
prompt_sequence_length=prompt_sequence_length,
internal_kv_cache=internal_kv_cache,
)

def tokenize(self, text: str) -> List[int]:
Expand Down Expand Up @@ -689,22 +687,14 @@ async def fake_stream_generator() -> AsyncGenerator[str, None]:
help="maximum number of input+output tokens the model will use",
)
parser.add_argument(
"--prompt-processing-sequence-length",
"--prompt-sequence-length",
type=int,
default=16,
help=(
"For large prompts, the prompt is processed in chunks of this length. "
"This is to maximize the inference speed. By default, this is set to 16."
),
)
parser.add_argument(
"--use-deepsparse-cache",
action="store_true",
help=(
"If True, the pipeline will use the deepsparse kv cache for caching the "
"model outputs."
),
)

parser.add_argument("--host", type=str, default="localhost", help="host name")
parser.add_argument("--port", type=int, default=8000, help="port number")
Expand Down Expand Up @@ -752,7 +742,6 @@ async def fake_stream_generator() -> AsyncGenerator[str, None]:
model=args.model,
sequence_length=max_model_len,
prompt_sequence_length=args.prompt_sequence_length,
internal_kv_cache=args.internal_kv_cache,
)
tokenizer = engine.engine.tokenizer

Expand Down