Skip to content

[Bug][V0][Trition MLA][GGUF]: Deepseek R1 GGUF starts producing gibberish towards the end of a longer generation #15340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
davidsyoung opened this issue Mar 22, 2025 · 11 comments
Labels
bug Something isn't working

Comments

@davidsyoung
Copy link

davidsyoung commented Mar 22, 2025

Your current environment

The output of `python collect_env.py`
vllm docker 0.8.1 openai server image

🐛 Describe the bug

When inferencing with DeepSeek R1 Q3_K_M gguf quant it starts to produce gibberish towards the end of a longer generation.

I have followed direction in #13167 (comment) in terms of the --tokenizer and ---hf-config-path configuration.

I have tested various different images with nightly, and most recent 0.8.1 release, the issue persists.

I would appreciate some direction on this, as vLLM is by far the fastest inference engine for GGUF on my 16x3090 config, but this bug (which @SzymonOzog had said he experienced a similar issue with model overflowing and producing NaNs, but that got fixed - ref here #13167 (comment)).

Unfortunately I'm at a bit of a loss to fix this myself.

Run command:

networks:
    vllm-dev:
        external: true
        name: br1
        
        
services:
  vllm-dev:
    image: vllm/vllm-openai:v0.8.1
    runtime: nvidia
    restart: unless-stopped
    networks:
      vllm-dev:
        ipv4_address: 192.168.x.x
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
      - NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
      - VLLM_RPC_TIMEOUT=180000
      - VLLM_PP_LAYER_PARTITION=31,30
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
    ports:
      - "8000:8000"
    volumes:
      - /mnt/user/appdata/models:/models
    ipc: "host"
    command: --swap-space 2 
        --model /models/dp-config/DeepSeek-R1-Q3_K_M.gguf \
        --enable-reasoning --reasoning-parser deepseek_r1 \
        --seed 3407 \
        --served-model-name deepseek-ai/DeepSeek-R1 \
        --hf-config-path /models/dp-v2/ --tokenizer /models/dp-v2/ \
        --gpu-memory-utilization 0.945 \
        --max-model-len 8192 \
        --max-num-seqs 3 \
        --trust-remote-code \
        --tensor-parallel-size 8 \
        --pipeline-parallel-size 2 \
        --host 192.168.10.225 \
        --port 8000 \
        --enable-chunked-prefill=True

Run log
INFO 03-22 18:40:10 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:14 [api_server.py:977] vLLM API server version 0.8.1
INFO 03-22 18:40:14 [api_server.py:978] args: Namespace(host='192.168.10.225', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='b18766c98a9b8092dcb66033afabff4f', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/models/dp-config/DeepSeek-R1-Q3_K_M.gguf', task='auto', tokenizer='/models/dp-v2/', hf_config_path='/models/dp-v2/', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=2, tensor_parallel_size=8, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=3407, swap_space=2.0, cpu_offload_gb=0, gpu_memory_utilization=0.945, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=3, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-ai/DeepSeek-R1'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=True, reasoning_parser='deepseek_r1', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
WARNING 03-22 18:40:14 [utils.py:2079] Found ulimit of 40960 and failed to automatically increase with error current limit exceeds maximum limit. This can cause fd limit errors like `OSError: [Errno 24] Too many open files`. Consider increasing with ulimit -n
INFO 03-22 18:40:14 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 03-22 18:40:23 [config.py:583] This model supports multiple tasks: {'score', 'classify', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.
WARNING 03-22 18:40:23 [config.py:662] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 03-22 18:40:23 [arg_utils.py:1765] --quantization gguf is not supported by the V1 Engine. Falling back to V0. 
INFO 03-22 18:40:23 [config.py:1515] Defaulting to use mp for distributed inference
INFO 03-22 18:40:23 [config.py:1693] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-22 18:40:24 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.1) with config: model='/models/dp-config/DeepSeek-R1-Q3_K_M.gguf', speculative_config=None, tokenizer='/models/dp-v2/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=8, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend='deepseek_r1'), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=3407, served_model_name=deepseek-ai/DeepSeek-R1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[4,2,1],"max_capture_size":4}, use_cached_outputs=False, 
WARNING 03-22 18:40:25 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-22 18:40:25 [cuda.py:190] Using Triton MLA backend.
WARNING 03-22 18:40:28 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
INFO 03-22 18:40:29 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:29 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:29 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:29 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:29 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:29 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:30 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:30 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:30 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:30 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:30 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:30 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:30 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:30 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 18:40:30 [__init__.py:256] Automatically detected platform cuda.
(VllmWorkerProcess pid=279) INFO 03-22 18:40:33 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=281) INFO 03-22 18:40:33 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=283) INFO 03-22 18:40:33 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=279) INFO 03-22 18:40:33 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=275) INFO 03-22 18:40:33 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=281) INFO 03-22 18:40:33 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=283) INFO 03-22 18:40:33 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=276) INFO 03-22 18:40:34 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=275) INFO 03-22 18:40:34 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=277) INFO 03-22 18:40:34 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=278) INFO 03-22 18:40:34 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=276) INFO 03-22 18:40:34 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=274) INFO 03-22 18:40:34 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=277) INFO 03-22 18:40:34 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=278) INFO 03-22 18:40:34 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=272) INFO 03-22 18:40:34 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=271) INFO 03-22 18:40:34 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=274) INFO 03-22 18:40:34 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=285) INFO 03-22 18:40:34 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=280) INFO 03-22 18:40:34 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=273) INFO 03-22 18:40:34 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=282) INFO 03-22 18:40:34 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=284) INFO 03-22 18:40:34 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=272) INFO 03-22 18:40:34 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=271) INFO 03-22 18:40:34 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=285) INFO 03-22 18:40:34 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=282) INFO 03-22 18:40:35 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=284) INFO 03-22 18:40:35 [cuda.py:190] Using Triton MLA backend.
(VllmWorkerProcess pid=279) WARNING 03-22 18:40:36 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=283) WARNING 03-22 18:40:38 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=281) WARNING 03-22 18:40:39 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=275) WARNING 03-22 18:40:39 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=276) WARNING 03-22 18:40:40 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=277) WARNING 03-22 18:40:40 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=273) WARNING 03-22 18:40:40 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=278) WARNING 03-22 18:40:40 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=285) WARNING 03-22 18:40:40 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=284) WARNING 03-22 18:40:40 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=272) WARNING 03-22 18:40:40 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=271) WARNING 03-22 18:40:41 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=282) WARNING 03-22 18:40:41 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=274) WARNING 03-22 18:40:41 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=280) WARNING 03-22 18:40:41 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=272) INFO 03-22 18:40:45 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=272) INFO 03-22 18:40:45 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=284) INFO 03-22 18:40:45 [utils.py:925] Found nccl from library libnccl.so.2
INFO 03-22 18:40:45 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=275) INFO 03-22 18:40:45 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=284) INFO 03-22 18:40:45 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 03-22 18:40:45 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=275) INFO 03-22 18:40:45 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=283) INFO 03-22 18:40:45 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=277) INFO 03-22 18:40:45 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=283) INFO 03-22 18:40:45 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=282) INFO 03-22 18:40:45 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=285) INFO 03-22 18:40:45 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=279) WARNING 03-22 18:40:47 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=281) WARNING 03-22 18:40:47 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=282) WARNING 03-22 18:40:47 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=285) WARNING 03-22 18:40:47 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=278) INFO 03-22 18:40:47 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_c8657217'), local_subscribe_addr='ipc:///tmp/59169403-03a0-4321-9215-d9317b8825e8', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=271) WARNING 03-22 18:40:47 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=273) WARNING 03-22 18:40:47 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 03-22 18:40:47 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=274) WARNING 03-22 18:40:47 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=272) WARNING 03-22 18:40:47 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=275) WARNING 03-22 18:40:47 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=276) WARNING 03-22 18:40:47 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=277) WARNING 03-22 18:40:47 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 03-22 18:40:47 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_4f24e55b'), local_subscribe_addr='ipc:///tmp/e6521145-fa0f-478e-896e-d5466e72273b', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 03-22 18:40:47 [utils.py:925] Found nccl from library libnccl.so.2
INFO 03-22 18:40:47 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=278) INFO 03-22 18:40:47 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=278) INFO 03-22 18:40:47 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=280) INFO 03-22 18:40:47 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=273) INFO 03-22 18:40:47 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=285) INFO 03-22 18:40:47 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=275) INFO 03-22 18:40:47 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=279) INFO 03-22 18:40:47 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=284) INFO 03-22 18:40:47 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=272) INFO 03-22 18:40:47 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=284) INFO 03-22 18:40:47 [parallel_state.py:967] rank 14 in world size 16 is assigned as DP rank 0, PP rank 1, TP rank 6
(VllmWorkerProcess pid=276) INFO 03-22 18:40:47 [parallel_state.py:967] rank 6 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 6
(VllmWorkerProcess pid=271) INFO 03-22 18:40:47 [parallel_state.py:967] rank 1 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorkerProcess pid=273) INFO 03-22 18:40:47 [parallel_state.py:967] rank 3 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 3
(VllmWorkerProcess pid=279) INFO 03-22 18:40:47 [parallel_state.py:967] rank 9 in world size 16 is assigned as DP rank 0, PP rank 1, TP rank 1
(VllmWorkerProcess pid=282) INFO 03-22 18:40:47 [parallel_state.py:967] rank 12 in world size 16 is assigned as DP rank 0, PP rank 1, TP rank 4
(VllmWorkerProcess pid=274) INFO 03-22 18:40:47 [parallel_state.py:967] rank 4 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 4
(VllmWorkerProcess pid=285) INFO 03-22 18:40:47 [parallel_state.py:967] rank 15 in world size 16 is assigned as DP rank 0, PP rank 1, TP rank 7
(VllmWorkerProcess pid=277) INFO 03-22 18:40:47 [parallel_state.py:967] rank 7 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 7
(VllmWorkerProcess pid=281) INFO 03-22 18:40:47 [parallel_state.py:967] rank 11 in world size 16 is assigned as DP rank 0, PP rank 1, TP rank 3
(VllmWorkerProcess pid=280) INFO 03-22 18:40:47 [parallel_state.py:967] rank 10 in world size 16 is assigned as DP rank 0, PP rank 1, TP rank 2
(VllmWorkerProcess pid=278) INFO 03-22 18:40:47 [model_runner.py:1110] Starting to load model /models/dp-config/DeepSeek-R1-Q3_K_M.gguf...
(VllmWorkerProcess pid=282) INFO 03-22 18:40:47 [model_runner.py:1110] Starting to load model /models/dp-config/DeepSeek-R1-Q3_K_M.gguf...
INFO 03-22 18:40:47 [model_runner.py:1110] Starting to load model /models/dp-config/DeepSeek-R1-Q3_K_M.gguf...
(VllmWorkerProcess pid=276) INFO 03-22 18:40:47 [model_runner.py:1110] Starting to load model /models/dp-config/DeepSeek-R1-Q3_K_M.gguf...
(VllmWorkerProcess pid=285) INFO 03-22 18:40:47 [model_runner.py:1110] Starting to load model /models/dp-config/DeepSeek-R1-Q3_K_M.gguf...
(VllmWorkerProcess pid=283) WARNING 03-22 18:41:18 [utils.py:169] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=279) WARNING 03-22 18:41:18 [utils.py:169] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=278) WARNING 03-22 18:41:18 [utils.py:169] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=285) WARNING 03-22 18:41:18 [utils.py:169] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=276) WARNING 03-22 18:41:18 [utils.py:169] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=273) WARNING 03-22 18:41:18 [utils.py:169] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
WARNING 03-22 18:41:19 [utils.py:169] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=282) WARNING 03-22 18:41:19 [utils.py:169] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=280) WARNING 03-22 18:41:19 [utils.py:169] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=275) WARNING 03-22 18:41:19 [utils.py:169] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=272) WARNING 03-22 18:41:20 [utils.py:169] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=275) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=275)   return _nested.nested_tensor(
(VllmWorkerProcess pid=277) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=277)   return _nested.nested_tensor(
(VllmWorkerProcess pid=271) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=276) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=273) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=273)   return _nested.nested_tensor(
/opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
(VllmWorkerProcess pid=279) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=284) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=282) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=285) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=283) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=280) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=279)   return _nested.nested_tensor(
(VllmWorkerProcess pid=284)   return _nested.nested_tensor(
(VllmWorkerProcess pid=278) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=282)   return _nested.nested_tensor(
(VllmWorkerProcess pid=281) /opt/venv/lib/python3.12/site-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
(VllmWorkerProcess pid=285)   return _nested.nested_tensor(
(VllmWorkerProcess pid=283)   return _nested.nested_tensor(
(VllmWorkerProcess pid=280)   return _nested.nested_tensor(
(VllmWorkerProcess pid=278)   return _nested.nested_tensor(
(VllmWorkerProcess pid=281)   return _nested.nested_tensor(
(VllmWorkerProcess pid=276) INFO 03-22 18:52:51 [model_runner.py:1146] Model loading took 18.3843 GB and 724.014383 seconds
(VllmWorkerProcess pid=285) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 19.5904 GB and 724.529272 seconds
(VllmWorkerProcess pid=273) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 18.3843 GB and 724.529966 seconds
(VllmWorkerProcess pid=277) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 18.3843 GB and 724.544103 seconds
(VllmWorkerProcess pid=274) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 18.3843 GB and 724.542021 seconds
(VllmWorkerProcess pid=275) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 18.3843 GB and 724.558444 seconds
(VllmWorkerProcess pid=271) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 18.3843 GB and 724.543127 seconds
(VllmWorkerProcess pid=283) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 19.5904 GB and 724.573560 seconds
INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 18.3843 GB and 724.537170 seconds
(VllmWorkerProcess pid=282) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 19.5904 GB and 724.575245 seconds
(VllmWorkerProcess pid=272) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 18.3843 GB and 724.618766 seconds
(VllmWorkerProcess pid=281) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 19.5904 GB and 724.653382 seconds
(VllmWorkerProcess pid=280) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 19.5904 GB and 724.653049 seconds
(VllmWorkerProcess pid=279) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 19.5904 GB and 724.657291 seconds
(VllmWorkerProcess pid=278) INFO 03-22 18:52:52 [model_runner.py:1146] Model loading took 19.5904 GB and 724.662290 seconds
(VllmWorkerProcess pid=284) INFO 03-22 18:52:55 [model_runner.py:1146] Model loading took 19.5904 GB and 727.899927 seconds
(VllmWorkerProcess pid=280) INFO 03-22 18:53:21 [worker.py:267] Memory profiling takes 25.52 seconds
(VllmWorkerProcess pid=280) INFO 03-22 18:53:21 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=280) INFO 03-22 18:53:21 [worker.py:267] model weights take 19.59GiB; non_torch_memory takes 0.20GiB; PyTorch activation peak memory takes 0.82GiB; the rest of the memory reserved for KV Cache is 1.68GiB.
(VllmWorkerProcess pid=279) INFO 03-22 18:53:21 [worker.py:267] Memory profiling takes 25.54 seconds
(VllmWorkerProcess pid=279) INFO 03-22 18:53:21 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=279) INFO 03-22 18:53:21 [worker.py:267] model weights take 19.59GiB; non_torch_memory takes 0.20GiB; PyTorch activation peak memory takes 0.82GiB; the rest of the memory reserved for KV Cache is 1.68GiB.
(VllmWorkerProcess pid=284) INFO 03-22 18:53:21 [worker.py:267] Memory profiling takes 25.58 seconds
(VllmWorkerProcess pid=284) INFO 03-22 18:53:21 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=284) INFO 03-22 18:53:21 [worker.py:267] model weights take 19.59GiB; non_torch_memory takes 0.20GiB; PyTorch activation peak memory takes 0.82GiB; the rest of the memory reserved for KV Cache is 1.68GiB.
(VllmWorkerProcess pid=283) INFO 03-22 18:53:21 [worker.py:267] Memory profiling takes 25.59 seconds
(VllmWorkerProcess pid=283) INFO 03-22 18:53:21 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=283) INFO 03-22 18:53:21 [worker.py:267] model weights take 19.59GiB; non_torch_memory takes 0.20GiB; PyTorch activation peak memory takes 0.82GiB; the rest of the memory reserved for KV Cache is 1.68GiB.
(VllmWorkerProcess pid=282) INFO 03-22 18:53:21 [worker.py:267] Memory profiling takes 25.58 seconds
(VllmWorkerProcess pid=282) INFO 03-22 18:53:21 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=282) INFO 03-22 18:53:21 [worker.py:267] model weights take 19.59GiB; non_torch_memory takes 0.20GiB; PyTorch activation peak memory takes 0.82GiB; the rest of the memory reserved for KV Cache is 1.68GiB.
(VllmWorkerProcess pid=281) INFO 03-22 18:53:21 [worker.py:267] Memory profiling takes 25.58 seconds
(VllmWorkerProcess pid=281) INFO 03-22 18:53:21 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=281) INFO 03-22 18:53:21 [worker.py:267] model weights take 19.59GiB; non_torch_memory takes 0.20GiB; PyTorch activation peak memory takes 0.82GiB; the rest of the memory reserved for KV Cache is 1.68GiB.
(VllmWorkerProcess pid=285) INFO 03-22 18:53:21 [worker.py:267] Memory profiling takes 25.59 seconds
(VllmWorkerProcess pid=285) INFO 03-22 18:53:21 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=285) INFO 03-22 18:53:21 [worker.py:267] model weights take 19.59GiB; non_torch_memory takes 0.20GiB; PyTorch activation peak memory takes 0.82GiB; the rest of the memory reserved for KV Cache is 1.68GiB.
(VllmWorkerProcess pid=278) INFO 03-22 18:53:21 [worker.py:267] Memory profiling takes 25.78 seconds
(VllmWorkerProcess pid=278) INFO 03-22 18:53:21 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=278) INFO 03-22 18:53:21 [worker.py:267] model weights take 19.59GiB; non_torch_memory takes 0.20GiB; PyTorch activation peak memory takes 0.82GiB; the rest of the memory reserved for KV Cache is 1.68GiB.
(VllmWorkerProcess pid=271) INFO 03-22 18:53:23 [worker.py:267] Memory profiling takes 27.79 seconds
(VllmWorkerProcess pid=271) INFO 03-22 18:53:23 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=271) INFO 03-22 18:53:23 [worker.py:267] model weights take 18.38GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 0.79GiB; the rest of the memory reserved for KV Cache is 2.96GiB.
(VllmWorkerProcess pid=273) INFO 03-22 18:53:23 [worker.py:267] Memory profiling takes 27.80 seconds
(VllmWorkerProcess pid=273) INFO 03-22 18:53:23 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=273) INFO 03-22 18:53:23 [worker.py:267] model weights take 18.38GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 0.79GiB; the rest of the memory reserved for KV Cache is 2.96GiB.
(VllmWorkerProcess pid=277) INFO 03-22 18:53:23 [worker.py:267] Memory profiling takes 27.81 seconds
(VllmWorkerProcess pid=277) INFO 03-22 18:53:23 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=277) INFO 03-22 18:53:23 [worker.py:267] model weights take 18.38GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 0.79GiB; the rest of the memory reserved for KV Cache is 2.96GiB.
INFO 03-22 18:53:23 [worker.py:267] Memory profiling takes 27.78 seconds
INFO 03-22 18:53:23 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
INFO 03-22 18:53:23 [worker.py:267] model weights take 18.38GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 0.79GiB; the rest of the memory reserved for KV Cache is 2.96GiB.
(VllmWorkerProcess pid=275) INFO 03-22 18:53:23 [worker.py:267] Memory profiling takes 27.82 seconds
(VllmWorkerProcess pid=275) INFO 03-22 18:53:23 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=274) INFO 03-22 18:53:23 [worker.py:267] model weights take 18.38GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 0.79GiB; the rest of the memory reserved for KV Cache is 2.96GiB.
(VllmWorkerProcess pid=276) INFO 03-22 18:53:23 [worker.py:267] Memory profiling takes 27.82 seconds
(VllmWorkerProcess pid=276) INFO 03-22 18:53:23 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.58GiB) x gpu_memory_utilization (0.94) = 22.29GiB
(VllmWorkerProcess pid=276) INFO 03-22 18:53:23 [worker.py:267] model weights take 18.38GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 0.79GiB; the rest of the memory reserved for KV Cache is 2.96GiB.
INFO 03-22 18:53:23 [executor_base.py:111] # cuda blocks: 3258, # CPU blocks: 3758
INFO 03-22 18:53:23 [executor_base.py:116] Maximum concurrency for 8192 tokens per request: 6.36x
(VllmWorkerProcess pid=285) INFO 03-22 18:53:34 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=282) INFO 03-22 18:53:35 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=283) INFO 03-22 18:53:35 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 03-22 18:53:38 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=272) INFO 03-22 18:53:38 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 3/3 [00:09<00:00,  3.32s/it]
Capturing CUDA graph shapes: 100%|██████████| 3/3 [00:08<00:00,  2.67s/it]
(VllmWorkerProcess pid=280) INFO 03-22 18:53:48 [model_runner.py:1570] Graph capturing finished in 11 secs, took 0.17 GiB
Capturing CUDA graph shapes: 100%|██████████| 3/3 [00:02<00:00,  1.05it/s]
(VllmWorkerProcess pid=278) INFO 03-22 18:53:48 [model_runner.py:1570] Graph capturing finished in 13 secs, took 0.17 GiB
(VllmWorkerProcess pid=283) INFO 03-22 18:53:48 [model_runner.py:1570] Graph capturing finished in 13 secs, took 0.17 GiB
(VllmWorkerProcess pid=284) INFO 03-22 18:53:48 [model_runner.py:1570] Graph capturing finished in 10 secs, took 0.17 GiB
(VllmWorkerProcess pid=279) INFO 03-22 18:53:48 [model_runner.py:1570] Graph capturing finished in 11 secs, took 0.17 GiB
(VllmWorkerProcess pid=282) INFO 03-22 18:53:48 [model_runner.py:1570] Graph capturing finished in 13 secs, took 0.17 GiB
(VllmWorkerProcess pid=281) INFO 03-22 18:53:48 [model_runner.py:1570] Graph capturing finished in 12 secs, took 0.17 GiB
(VllmWorkerProcess pid=285) INFO 03-22 18:53:48 [model_runner.py:1570] Graph capturing finished in 14 secs, took 0.17 GiB
(VllmWorkerProcess pid=276) INFO 03-22 18:53:49 [model_runner.py:1570] Graph capturing finished in 11 secs, took 0.19 GiB
(VllmWorkerProcess pid=274) INFO 03-22 18:53:49 [model_runner.py:1570] Graph capturing finished in 12 secs, took 0.19 GiB
(VllmWorkerProcess pid=272) INFO 03-22 18:53:49 [model_runner.py:1570] Graph capturing finished in 11 secs, took 0.19 GiB
(VllmWorkerProcess pid=275) INFO 03-22 18:53:49 [model_runner.py:1570] Graph capturing finished in 11 secs, took 0.19 GiB
(VllmWorkerProcess pid=273) INFO 03-22 18:53:49 [model_runner.py:1570] Graph capturing finished in 11 secs, took 0.19 GiB
(VllmWorkerProcess pid=277) INFO 03-22 18:53:49 [model_runner.py:1570] Graph capturing finished in 11 secs, took 0.19 GiB
(VllmWorkerProcess pid=271) INFO 03-22 18:53:49 [model_runner.py:1570] Graph capturing finished in 12 secs, took 0.19 GiB
Capturing CUDA graph shapes: 100%|██████████| 3/3 [00:02<00:00,  1.01it/s]
INFO 03-22 18:53:49 [model_runner.py:1570] Graph capturing finished in 11 secs, took 0.19 GiB
INFO 03-22 18:53:49 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 54.39 seconds
INFO 03-22 18:53:50 [serving_chat.py:115] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.95}
INFO 03-22 18:53:50 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.95}
INFO 03-22 18:53:50 [api_server.py:1024] Starting vLLM API server on http://192.168.10.225:8000
INFO 03-22 18:53:50 [launcher.py:26] Available routes are:
INFO 03-22 18:53:50 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET
INFO 03-22 18:53:50 [launcher.py:34] Route: /docs, Methods: HEAD, GET
INFO 03-22 18:53:50 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 03-22 18:53:50 [launcher.py:34] Route: /redoc, Methods: HEAD, GET
INFO 03-22 18:53:50 [launcher.py:34] Route: /health, Methods: GET
INFO 03-22 18:53:50 [launcher.py:34] Route: /load, Methods: GET
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO 03-22 18:54:25 [chat_utils.py:346] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 03-22 18:54:25 [logger.py:39] Received request chatcmpl-39c6bdfde1e143c19c76dfe72ce3cc3e: prompt: "<|begin▁of▁sentence|><|User|>Show me a code snippet of a website's sticky header in CSS and JavaScript.<|Assistant|><think>\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=8171, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 03-22 18:54:25 [async_llm_engine.py:211] Added request chatcmpl-39c6bdfde1e143c19c76dfe72ce3cc3e.
INFO:     192.168.1.64:44616 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(VllmWorkerProcess pid=277) /opt/venv/lib/python3.12/site-packages/vllm/distributed/parallel_state.py:408: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1561.)
(VllmWorkerProcess pid=277)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
[rank7]:[W322 18:54:27.377827544 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank15]:[W322 18:54:27.377957486 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
(VllmWorkerProcess pid=271) /opt/venv/lib/python3.12/site-packages/vllm/distributed/parallel_state.py:408: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1561.)
(VllmWorkerProcess pid=271)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
(VllmWorkerProcess pid=275) /opt/venv/lib/python3.12/site-packages/vllm/distributed/parallel_state.py:408: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1561.)
(VllmWorkerProcess pid=273) /opt/venv/lib/python3.12/site-packages/vllm/distributed/parallel_state.py:408: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1561.)
(VllmWorkerProcess pid=272) /opt/venv/lib/python3.12/site-packages/vllm/distributed/parallel_state.py:408: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1561.)
(VllmWorkerProcess pid=275)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
(VllmWorkerProcess pid=273)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
(VllmWorkerProcess pid=272)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
(VllmWorkerProcess pid=274) /opt/venv/lib/python3.12/site-packages/vllm/distributed/parallel_state.py:408: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1561.)
(VllmWorkerProcess pid=274)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
(VllmWorkerProcess pid=276) /opt/venv/lib/python3.12/site-packages/vllm/distributed/parallel_state.py:408: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1561.)
(VllmWorkerProcess pid=276)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
[rank1]:[W322 18:54:27.380129008 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank4]:[W322 18:54:27.380217129 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank2]:[W322 18:54:27.380218309 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank3]:[W322 18:54:27.380219399 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank5]:[W322 18:54:27.380248339 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank6]:[W322 18:54:27.380314470 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank9]:[W322 18:54:27.380311840 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank11]:[W322 18:54:27.380335620 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank12]:[W322 18:54:27.380335960 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank13]:[W322 18:54:27.380351460 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank10]:[W322 18:54:27.380370780 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank14]:[W322 18:54:27.380446631 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
/opt/venv/lib/python3.12/site-packages/vllm/distributed/parallel_state.py:408: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1561.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
[rank8]:[W322 18:54:27.389643976 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank0]:[W322 18:54:27.389699367 ProcessGroupNCCL.cpp:3436] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
INFO 03-22 18:54:28 [async_llm_engine.py:223] Aborted request chatcmpl-39c6bdfde1e143c19c76dfe72ce3cc3e.
INFO 03-22 18:54:40 [metrics.py:481] Avg prompt throughput: 1.4 tokens/s, Avg generation throughput: 0.3 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 03-22 18:54:50 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 03-22 18:54:56 [logger.py:39] Received request chatcmpl-3b4a21965d0a4add83f94e8cd2d84d7d: prompt: "<|begin▁of▁sentence|><|User|>Show me a code snippet of a website's sticky header in CSS and JavaScript.<|Assistant|><think>\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=8171, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 03-22 18:54:56 [async_llm_engine.py:211] Added request chatcmpl-3b4a21965d0a4add83f94e8cd2d84d7d.
INFO:     192.168.1.64:34600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 03-22 18:55:01 [metrics.py:481] Avg prompt throughput: 4.2 tokens/s, Avg generation throughput: 29.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 03-22 18:55:06 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.6%, CPU KV cache usage: 0.0%.
INFO 03-22 18:55:11 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0%.
INFO 03-22 18:55:16 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.2%, CPU KV cache usage: 0.0%.
INFO 03-22 18:55:21 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0%.
INFO 03-22 18:55:26 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.8%, CPU KV cache usage: 0.0%.
INFO 03-22 18:55:31 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.1%, CPU KV cache usage: 0.0%.
INFO 03-22 18:55:36 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.3%, CPU KV cache usage: 0.0%.
INFO 03-22 18:55:41 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.6%, CPU KV cache usage: 0.0%.
INFO 03-22 18:55:46 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.9%, CPU KV cache usage: 0.0%.
INFO 03-22 18:55:51 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.2%, CPU KV cache usage: 0.0%.
INFO 03-22 18:55:56 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.5%, CPU KV cache usage: 0.0%.
INFO 03-22 18:56:01 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.7%, CPU KV cache usage: 0.0%.
INFO 03-22 18:56:06 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.0%, CPU KV cache usage: 0.0%.
INFO 03-22 18:56:11 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.3%, CPU KV cache usage: 0.0%.
INFO 03-22 18:56:16 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.5%, CPU KV cache usage: 0.0%.
INFO 03-22 18:56:21 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.8%, CPU KV cache usage: 0.0%.
INFO 03-22 18:56:26 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 5.1%, CPU KV cache usage: 0.0%.
INFO 03-22 18:56:31 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 5.3%, CPU KV cache usage: 0.0%.
INFO 03-22 18:56:36 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 5.6%, CPU KV cache usage: 0.0%.
INFO 03-22 18:56:41 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 5.9%, CPU KV cache usage: 0.0%.
INFO 03-22 18:56:46 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 6.1%, CPU KV cache usage: 0.0%.
INFO 03-22 18:56:51 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 6.4%, CPU KV cache usage: 0.0%.
INFO 03-22 18:56:56 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 6.6%, CPU KV cache usage: 0.0%.
INFO 03-22 18:57:01 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 6.9%, CPU KV cache usage: 0.0%.
INFO 03-22 18:57:06 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 25.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.1%, CPU KV cache usage: 0.0%.
INFO 03-22 18:57:11 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.3%, CPU KV cache usage: 0.0%.
INFO 03-22 18:57:16 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 24.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.6%, CPU KV cache usage: 0.0%.
INFO 03-22 18:57:46 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 25.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 9.0%, CPU KV cache usage: 0.0%.
INFO 03-22 18:57:51 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 24.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 9.2%, CPU KV cache usage: 0.0%.
INFO 03-22 18:57:56 [metrics.py:481] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 24.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 9.5%, CPU KV cache usage: 0.0%.
... I have to kill generation ...

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@davidsyoung davidsyoung added the bug Something isn't working label Mar 22, 2025
@davidsyoung
Copy link
Author

davidsyoung commented Mar 22, 2025

Example output:

Output
### USER
Show me a code snippet of a website's sticky header in CSS and JavaScript.

### ASSISTANT
<details type="reasoning" done="true" duration="73">
<summary>Thought for 73 seconds</summary>
> Okay, the user wants a code snippet for a sticky header using CSS and JavaScript. Let me think about how to approach this.
> 
> First, I know a sticky header stays at the top of the page when scrolling. So, in CSS, positioning is key. Using position: sticky; or fixed? Hmm, sticky is good because it can toggle between relative and fixed based on scroll position. But maybe some browsers support fixed better. Wait, position: sticky is supported in modern browsers, but maybe I should also include fixed as a fallback or alternative.
> 
> Wait, the user might want a header that becomes sticky after scrolling past a certain point. So initially, the header is normal, then when you scroll down, it sticks. For that, JavaScript is needed to detect the scroll position and add a class to the header.
> 
> So the CSS part would have the header with position: fixed or sticky. Fixed might be better for cross-browser compatibility. Let me set the header to position: fixed, top: 0, width 100%, maybe with a z-index to stay above other content. But initially, maybe it's not fixed, and then when scrolling, JavaScript adds a class to make it fixed.
> 
> Wait, no. If using position: fixed, you need to have it fixed from the start, but maybe with a transition when adding a class that changes the background or shadow. Alternatively, using position: sticky might require the parent to have some properties.
> 
> Wait, maybe the initial setup is position: static, and when scrolling down, JavaScript adds a class that changes it to fixed. But that could cause a jump. Alternatively, using position: sticky might be simpler, but sometimes people use fixed with a scroll event listener to add a class that makes it stick.
> 
> Let me recall. For a sticky header that appears after scrolling past a certain point, the common approach is to have the header initially at the top, then when the user scrolls beyond a certain threshold (like the header's height), the header becomes fixed.
> 
> So, in CSS, the header might start with position: absolute or relative, and then when the class is added, it becomes fixed. Alternatively, using position: sticky with top: 0, but sometimes that doesn't work if the parent doesn't have a defined height.
> 
> Hmm, perhaps using position: fixed from the start, but hiding it or having it become visible after scrolling. Or maybe the header is part of the normal flow and then becomes fixed when scrolled.
> 
> Wait, maybe the structure is: the header is in the normal flow, and when the user scrolls down, it becomes fixed. To prevent the content from jumping up when the header is fixed, the header's original space is maintained with a placeholder element.
> 
> So, in the HTML, there's the header element, and a placeholder div that's hidden initially. When the header becomes fixed, the placeholder is shown to take up the space, so the content below doesn't jump.
> 
> Okay, that makes sense. So the CSS for the header would initially have position: static, then when the 'sticky' class is added via JavaScript, it becomes position: fixed, top: 0, etc. The placeholder is created to hold the space.
> 
> Now, for the JavaScript part: listen to the scroll event, check the scroll position, and when it's past a certain point (like the header's offsetTop), add the 'sticky' class to the header and show the placeholder.
> 
> Also, to optimize performance, maybe throttle the scroll event listener. But for simplicity, the code might just add the event listener without throttling, but it's something to note.
> 
> Another consideration: the header might have a different style when sticky, like a shadow or smaller height. So the CSS for the 'sticky' class would include transitions for smooth visual changes.
> 
> So putting it all together:
> 
> HTML has a header and a placeholder. CSS for header includes transitions for properties like background and box-shadow. The JavaScript checks the scroll position and toggles the 'sticky' class.
> 
> Wait, in the example, maybe the header is initially at the top, and when scrolled down, it sticks with a shadow. Let's see.
> 
> In the CSS:
> 
> header {
>   padding: 20px;
>   background: white;
>   transition: all 0.3s ease;
>   /* other styles */
> }
> 
> header.sticky {
>   position: fixed;
>   top: 0;
>   width: 100%;
>   box-shadow: 0 2px 5px rgba(0,0,0,0.1);
> }
> 
> Then, in JavaScript:
> 
> window.addEventListener('scroll', function() {
>   const header = document.querySelector('header');
>   const placeholder = document.getElementById('header-placeholder');
>   const scrollPosition = window.scrollY || document.documentElement.scrollTop;
> 
>   if (scrollPosition > header.offsetTop) {
>     header.classList.add('sticky');
>     placeholder.style.display = 'block';
>   } else {
>     header.classList.remove('sticky');
>     placeholder.style.display = 'none';
>   }
> });
> 
> Wait, but header.offsetTop might be zero if it's at the top. So maybe compare scrollPosition against a threshold, like 100 pixels, or the header's original offset.
> 
> Alternatively, if the header is the first element, its offsetTop is 0, so when you scroll past that, add the class. But perhaps better to use a threshold like 100, or the header's own height.
> 
> Wait, maybe the header's height is the threshold. Let's say the header is 80px tall. When the user scrolls past 80px, the header becomes sticky.
> 
> But in the code, how to get the header's height? Maybe header.offsetHeight.
> 
> So:
> 
> const headerHeight = header.offsetHeight;
> 
> if (scrollPosition > headerHeight) {
>   // add sticky class
> }
> 
> But maybe the initial position of the header is not fixed, so when you scroll past its position, it becomes fixed. So perhaps the threshold is the header's original position.
> 
> Alternatively, maybe the header is part of the document flow, so when you scroll past it, it sticks. But that might require using position: sticky with top: 0, but sometimes that's not reliable.
> 
> Hmm, perhaps using position: sticky is simpler. Let me check.
> 
> If the header has position: sticky; top: 0;, then it would stick when scrolling. But sometimes, the parent container's overflow or other properties can affect this. But for a basic example, that might work without JavaScript.
> 
> But the user asked for CSS and JavaScript, so maybe the example uses JavaScript to add a class that changes the style, making it fixed after a certain scroll point.
> 
> So, combining both approaches: CSS for styling and JavaScript for handling the scroll event.
> 
> Another consideration: When the header becomes fixed, it's removed from the document flow, so the content below jumps up. Hence, the placeholder is needed to take up the space.
> 
> So in the HTML:
> 
> <header>...</header>
> <div id="header-placeholder"></div>
> 
> The placeholder is initially hidden (display: none), and when the header becomes fixed, the placeholder is displayed (display: block), with the same height as the header.
> 
> In JavaScript, when adding the sticky class, also set the placeholder's height to the header's height.
> 
> So in code:
> 
> const header = document.querySelector('header');
> const placeholder = document.getElementById('header-placeholder');
> let headerHeight = header.offsetHeight;
> 
> window.addEventListener('scroll', function() {
>   let scrollPosition = window.pageYOffset || document.documentElement.scrollTop;
> 
>   if (scrollPosition > headerHeight) {
>     header.classList.add('sticky');
>     placeholder.style.height = headerHeight + 'px';
>     placeholder.style.display = 'block';
>   } else {
>     header.classList.remove('sticky');
>     placeholder.style.display = 'none';
>     placeholder.style.height = '0';
>   }
> });
> 
> But wait, when the page loads, the header's height is measured. If the header's height changes (e.g., on resize), we need to recalculate. So maybe add a resize event listener to update headerHeight.
> 
> But for simplicity, maybe the example doesn't handle that, but it's something to note.
> 
> Also, using window.pageYOffset is better for cross-browser compatibility.
> 
> Putting it all together:
> 
> CSS:
> 
> header {
>   padding: 20px;
>   background: white;
>   transition: all 0.3s ease;
>   z-index: 1000;
> }
> 
> header.sticky {
>   position: fixed;
>   top: 0;
>   left: 0;
>   right: 0;
>   box-shadow: 0 2px 5px rgba(0,0,0,0.1);
> }
> 
> #header-placeholder {
>   display: none;
> }
> 
> HTML:
> 
> <header>
>   <!-- header content -->
> </header>
> <div id="header-placeholder"></div>
> 
> JavaScript:
> 
> document.addEventListener('DOMContentLoaded', function() {
>   const header = document.querySelector('header');
>   const placeholder = document.getElementById('header-placeholder');
>   let headerHeight = header.offsetHeight;
> 
>   window.addEventListener('scroll', function() {
>     const scrollPosition = window.pageYOffset || document.documentElement.scrollTop;
> 
>     if (scrollPosition > headerHeight) {
>       header.classList.add('sticky');
>       placeholder.style.display = 'block';
>       placeholder.style.height = headerHeight + 'px';
>     } else {
>       header.classList.remove('sticky');
>       placeholder.style.display = 'none';
>       placeholder.style.height = '0';
>     }
>   });
> 
>   // Optional: Update header height on window resize
>   window.addEventListener('resize', function() {
>     headerHeight = header.offsetHeight;
>   });
> });
> 
> Wait, but in the initial state, the header is not fixed. So when the page is at the top, the header is in the normal flow. When you scroll past its height, it becomes fixed, and the placeholder takes its space to prevent content jump.
> 
> This should work. Also, adding a transition to the header for smooth visual changes when adding the sticky class.
> 
> Testing for any possible issues: If the header's height changes after the initial load (e.g., dynamic content or window resize), the resize event listener updates headerHeight. That's good.
> 
> Also, using DOMContentLoaded to ensure the elements are loaded before accessing them.
> 
> This should cover the basics. Now, putting this into a code snippet.
</details>
Here's a code snippet for a sticky header that becomes fixed at the top of the page when scrolling down, with smooth transitions and a placeholder to prevent content jump:

```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sticky Header Example</title>
    <style>
        * {
            margin:---
title: 前端小知识
date: 2023-04# 用户认证

## 用户认证

用户认证是绝大多数应用的重要组成部分,Lumen 当然也会帮助您实现这一功能。不过,和您所熟悉的在 Laravel 中的实现方式不同,我们需要单独引入 `Laravel\Passport` 来对认证功能提供支持,因为 Lumen 默认是不提供用户认证功能的。

### 安装 Passport

在开始之前,我们通过 Composer 来安装 Passport:

```bash
composer require laravel/passport

Passport 的服务提供器会自行完成服务注册,不过您需要在 bootstrap/app.php 中取消对 $app->withFacades()$app->withEloquent() 的注释,因为它们都是 Passport 需要用到的。

接下来,您需要运行 migrate 命令来创建用于存储客户端和令牌的数据表:

php artisan migrate

然后,运行 passport:install 命令来创建生成安全访问令牌时所需要的加密密钥:

php artisan passport:install

该命令执行后,请将 Laravel\Passport\HasApiTokens Trait 添加到您的 App\User 模型中:

<?php

namespace App;

use Laravel\Passport\HasApiTokens;
use Illuminate\Notifications\Notifiable;
use Illuminate\Contracts\Auth\MustVerifyEmail;
use Illuminate\Foundation\Auth\User as Authenticatable;

class User extends Authenticatable
{
    use HasApiTokens, Notifiable;
}

接下来,您应该在 AuthServiceProviderboot 方法中调用 Passport::routes 方法。该方法会注册发出访问令牌并撤销访问令牌、客户端以及个人访问令牌所必需的路由:

<?php

namespace App\Providers;

use Laravel\Passport\Passport;
use Illuminate\Support\Facades\Gate;
use Illuminate\Foundation\Support\Providers\AuthServiceProvider as ServiceProvider;

class AuthServiceProvider extends ServiceProvider
{
    /**
     * The policy mappings for the application.
     *
     * @var array
     */
    protected $policies = [
        'App\Model' => 'App\Policies\ModelPolicy',
    ];

    /**
     * Register any authentication / authorization services.
     *
     * @return void
     */
    public function boot()
    {
        $this->registerPolicies();

        Passport::routes();
    }
}

最后,在 bootstrap/app.php 配置文件中,您应该通过调用 AuthServiceProvider 中的 boot 方法启用 Passport 的路由:

$app->register(App\Providers\AuthServiceProvider::class);

配置

AuthServiceProviderboot 方法中,您可以调用 Passport::tokensExpireInPassport::refreshTokensExpireIn 方法,分别用于设定访问令牌和刷新令牌的过期时间。这两个方法都接受一个 Carbon 实例用于设定令牌的过期时间:

/**
 * Register any authentication / authorization services.
 *
 * @return void
 */
public function boot()
{
    $this->registerPolicies();

    Passport::routes();

    Passport::tokensExpireIn(now()->addDays(15));

    Passport::refreshTokensExpireIn(now()->addDays(30));
}

注意:默认情况下,Passport 的令牌是长期有效的,上述方法的作用是用于指定多长时间之后令牌会过期,如果您希望您的令牌永远有效,则无需调用上述方法。

发放访问令牌

使用 OAuth2 授权码方式发放令牌时,您的应用需要与 Passport 的令牌发放接口进行交互,以发放访问令牌给其它客户端。这种方式下,编写整个认证流程会非常麻烦,不过别担心,接下来的内容会帮助您一步步的完成整个过程。

首先,让客户端应用通过用户的浏览器向您的应用发起请求,请求参数包括 client_idredirect_uriresponse_typescope 以及 state

其中,client_id 参数的值是由您通过 passport:client 命令创建的客户端 ID。redirect_uri 是客户端应用接收授权码的地址,response_type 的值应该设置为 codescope 参数用于指定用户授权的权限范围,state 参数则用于防范 CSRF 攻击。

客户端 ID 与密钥

如果您还没有创建任何客户端,可以先通过 passport:client 命令创建一个。例如,在 Homestead 中,执行下面的命令:

php artisan passport:client --redirect_uri=http://localhost

--redirect_uri 参数的值是客户端应用接收授权码的地址,您可以传递多个 --redirect_uri 参数,也可以使用通配符。当命令执行后,Passport 会为您创建客户端,并返回客户端的 ID 和密钥:

Client ID: 3
Client secret: KSPXwy5n1MZmxvIln6k6ubunh3X0aw5asdfkDSF

重定向请求

接下来,客户端应用需要将用户的浏览器重定向到您的应用上的 /oauth/authorize 路由:

Route::get('/redirect', function (Request $request) {
    $request->session()->put('state', $state = Str::random(40));

    $query = http_build_query([
        'client_id' => 'client-id',
        'redirect_uri' => 'http://example.com/callback',
        'response_type' => 'code',
        'scope' => '',
        'state' => $state,
    ]);

    return redirect('http://your-app.com/oauth/authorize?'.$query);
});

需要注意的是,/oauth/authorize 路由已经被 Passport 注册,您不需要自己定义该路由。该路由用于验证用户身份,并让用户授权给客户端应用。如果用户同意授权,则会被重定向到客户端应用指定的 redirect_uri,并附带授权码和 state 参数。

如果用户拒绝授权,则会重定向到 redirect_uri,并附带 error 参数,该参数的值为 access_denied

转换授权码为访问令牌

如果用户授权成功,客户端应用会收到一个授权码,接下来,客户端应用需要将授权码转换为访问令牌。此时,客户端应用需要向您的应用发送 POST 请求到 /oauth/token 路由,参数包括 grant_typecodeclient_idclient_secret,以及 redirect_uri

$response = Http::post('http://your-app.com/oauth/token', [
    'grant_type' => 'authorization_code',
    'client_id' => 'client-id',
    'client_secret' => 'client-secret',
    'code' => $request->code,
    'redirect_uri' => $request->redirect_uri,
]);

该请求的响应会包含 access_tokenrefresh_token,以及 expires_in 等字段。expires_in 的值表示访问令牌的有效期,单位为秒。

刷新令牌

访问令牌的有效期通常比较短,当令牌过期后,您需要使用刷新令牌来获取新的访问令牌。客户端应用需要向您的应用发送 POST 请求到 /oauth/token 路由,参数包括 grant_typerefresh_tokenclient_id,以及 client_secret

$response = Http::post('http://your-app.com/oauth/token', [
    'grant_type' => 'refresh_token',
    'refresh_token' => 'the-refresh-token',
    'client_id' => 'client-id',
    'client_secret' => 'client-secret',
    'scope' => '',
]);

响应会包含新的 access_tokenrefresh_token

密码授权令牌

OAuth2 密码授权方式允许您通过用户名和密码直接获取访问令牌。这种方式适用于您信任的客户端,例如您的移动应用。

要使用密码授权方式,首先需要创建一个密码授权的客户端:

php artisan passport:client --password

该命令会提示您输入客户端的名称,然后返回客户端的 ID 和密钥。

接下来,客户端应用需要向您的应用发送 POST 请求到 /oauth/token 路由,参数包括 grant_typeclient_idclient_secretusernamepassword,以及 scope

$response = Http::post('http://your-app.com/oauth/token', [
    'grant_type' => 'password',
    'client_id' => 'client-id',
    'client_secret' => 'client-secret',
    'username' => '[email protected]',
    'password' => 'my-password',
    'scope' => '',
]);

该请求的响应会包含 access_tokenrefresh_token,以及 expires_in 等字段。

客户端凭证授权令牌

客户端凭证授权适用于机器到机器的认证。例如,您有一个 API 需要从另一个服务访问,而不是由用户访问。

要使用客户端凭证授权方式,首先需要创建一个客户端:

php artisan passport:client --client

然后,客户端应用需要向您的应用发送 POST 请求到 /oauth/token 路由,参数包括 grant_typeclient_idclient_secret,以及 scope

$response = Http::post('http://your-app.com/oauth/token', [
    'grant_type' => 'client_credentials',
    'client_id' => 'client-id',
    'client_secret' => 'client-secret',
    'scope' => 'your-scope',
]);

响应会包含 access_tokenexpires_in 字段。

隐式授权令牌

隐式授权类似于授权码授权,但是它直接在浏览器中发放访问令牌,而不是通过中间授权码。这种方式通常用于单页应用或者纯前端应用。

要使用隐式授权,首先需要创建一个客户端:

php artisan passport:client --public

然后,客户端应用需要将用户的浏览器重定向到您的应用上的 /oauth/authorize 路由,参数包括 client_idredirect_uriresponse_typescope,以及 state

Route::get('/redirect', function () {
    $query = http_build_query([
        'client_id' => 'client-id',
        'redirect_uri' => 'http://example.com/callback',
        'response_type' => 'token',
        'scope' => '',
        'state' => Str::random(40),
    ]);

    return redirect('http://your-app.com/oauth/authorize?'.$query);
});

如果用户授权成功,则会重定向到 redirect_uri,并在 URL 的片段中包含 access_tokentoken_typeexpires_in,以及 state 参数。

个人访问令牌

有时候,用户可能希望为自己颁发一个访问令牌,而不需要经过完整的 OAuth2 流程。Passport 提供了此功能,允许用户通过您的 web 界面颁发个人访问令牌。

首先,您需要创建一个客户端用于颁发个人访问令牌:

php artisan passport:client --personal

该命令会提示您输入客户端的名称,然后返回客户端的 ID 和密钥。

接下来,您需要创建一个路由,允许用户查看他们的令牌,并创建新令牌:

Route::get('/settings', function () {
    return view('settings', [
        'tokens' => Auth::user()->tokens,
        'clients' => Auth::user()->clients,
    ]);
})->middleware('auth');

Route::post('/settings/token', function (Request $request) {
    $request->validate([
        'name' => 'required|max:255',
    ]);

    $token = Auth::user()->createToken($request->name);

    return view('token', ['token' => $token->accessToken]);
})->middleware('auth');

在视图中,您可以显示用户现有的令牌,并提供一个表单让用户创建新令牌:

<form method="POST" action="/settings/token">
    @csrf
    <input type="text" name="name" placeholder="Token name">
    <button type="submit">Create Token</button>
</form>

用户提交表单后,会创建一个新的个人访问令牌,并显示给用户。

保护路由

现在,用户已经可以获取访问令牌了,接下来需要保护您的 API 路由,只允许持有有效访问令牌的用户访问。

首先,在 bootstrap/app.php 中,取消对 $app->routeMiddleware 的注释,并添加 auth 中间件:

$app->routeMiddleware([
    'auth' => App\Http\Middleware\Authenticate::class,
]);

然后,在您的 API 路由中,使用 auth:api 中间件来保护:

$router->group(['middleware' => 'auth:api'], function () use ($router) {
    $router->get('/user', function () {
        return auth()->user();
    });
});

现在,当用户访问受保护的路由时,他们需要在请求头中携带访问令牌:

curl -H "Accept: application/json" -H "Authorization: Bearer {access-token}" http://your-app.com/user

撤销令牌

用户可能需要撤销他们颁发的令牌。要撤销令牌,可以在 App\User 模型中添加 tokens 关系:

public function tokens()
{
    return $this->hasMany(Token::class);
}

然后,您可以允许用户通过您的 web 界面撤销令牌:

Route::delete('/settings/token/{token}', function ($tokenId) {
    Auth::user()->tokens()->where('id', $tokenId)->delete();

    return redirect('/settings');
})->middleware('auth');

用户访问该路由时,指定的令牌将被撤销。

令牌作用域

Passport 允许您为令牌定义作用域,以限制令牌的访问权限。首先,在 AuthServiceProviderboot 方法中定义您的作用域:

use Laravel\Passport\Passport;

Passport::tokensCan([
    'place-orders' => 'Place orders',
    'check-status' => 'Check order status',
]);

当用户授权客户端时,可以请求这些作用域:

Route::get('/redirect', function () {
    $query = http_build_query([
        'client_id' => 'client-id',
        'redirect_uri' => 'http://example.com/callback',
        'response_type' => 'code',
        '

@SzymonOzog
Copy link
Contributor

Could you share your config.json?

@davidsyoung
Copy link
Author

davidsyoung commented Mar 23, 2025

Of course, I have tried a few different variations with not much luck (have tried different dtype, etc):

cat config.json
{
  "architectures": [
    "DeepseekV3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_deepseek.DeepseekV3Config",
    "AutoModel": "modeling_deepseek.DeepseekV3Model",
    "AutoModelForCausalLM": "modeling_deepseek.DeepseekV3ForCausalLM"
  },
  "aux_loss_alpha": 0.001,
  "bos_token_id": 0,
  "eos_token_id": 1,
  "ep_size": 1,
  "first_k_dense_replace": 3,
  "hidden_act": "silu",
  "hidden_size": 7168,
  "initializer_range": 0.02,
  "intermediate_size": 18432,
  "kv_lora_rank": 512,
  "max_position_embeddings": 163840,
  "model_type": "deepseek_v3",
  "moe_intermediate_size": 2048,
  "moe_layer_freq": 1,
  "n_group": 8,
  "n_routed_experts": 256,
  "n_shared_experts": 1,
  "norm_topk_prob": true,
  "num_attention_heads": 128,
  "num_experts_per_tok": 8,
  "num_hidden_layers": 61,
  "num_key_value_heads": 128,
  "num_nextn_predict_layers": 1,
  "pad_token_id": 128815,
  "pretraining_tp": 1,
  "q_lora_rank": 1536,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 1.0,
    "mscale_all_dim": 1.0,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },
  "rope_theta": 10000,
  "routed_scaling_factor": 2.5,
  "scoring_func": "sigmoid",
  "seq_aux": true,
  "tie_word_embeddings": false,
  "topk_group": 4,
  "topk_method": "noaux_tc",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.1",
  "unsloth_fixed": true,
  "use_cache": true,
  "v_head_dim": 128,
  "vocab_size": 129280
}

@SzymonOzog
Copy link
Contributor

Hmm, looks fine to me, do you know what layer is the first one to produce NaNs? I'll try to reproduce this but haven't seen this after switching to bfloat16(I'm using q4)

@davidsyoung
Copy link
Author

How would I go about checking which layer produces NaNs? If you let me know I’ll do that no problem! Thank you!

Unfortunately I can’t fit Q4 in!

@davidsyoung
Copy link
Author

@SzymonOzog have you been able to reproduce on your end?

@SzymonOzog
Copy link
Contributor

@davidsyoung Sorry, I've been very GPU poor this week and unable to run it locally, hopefully will find some time to test it next week

@davidsyoung
Copy link
Author

No panic whatsoever, I can test on my side if there’s any patches you’d like me to apply?

@SzymonOzog
Copy link
Contributor

@davidsyoung How I tested it last time was just putting in an assertion after every call to a gguf quantized layer in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/gguf.py and seeing where are the NaNs returned, it would also be nice to check if they are running in bf16

@davidsyoung
Copy link
Author

@davidsyoung How I tested it last time was just putting in an assertion after every call to a gguf quantized layer in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/gguf.py and seeing where are the NaNs returned, it would also be nice to check if they are running in bf16

I can’t say I’m good enough with coding to do this, however I believe I’ve more useful information for this bug.

I believe it actually may be in the V0 Triton MLA engine! It also happens with this GPTQ quant:

https://huggingface.co/OPEA/DeepSeek-R1-int4-gptq-sym-inc

With tp=16, chunked prefill enabled.

@davidsyoung
Copy link
Author

@DefTruth going to tag you as you seem to be working on MLA quite a bit, along with a recent PR for the V1 engine.

Could this possibly be related?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog
Development

No branches or pull requests

2 participants