Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dockerfile.ppc64le changes to move to UBI #15402

Merged
merged 3 commits into from
Mar 25, 2025

Conversation

Shafi-Hussain
Copy link
Contributor

@Shafi-Hussain Shafi-Hussain commented Mar 24, 2025

What was changed?

  1. torch, torchvision, torchaudio dependencies for ppc64le have been updated to sync with x86
  2. Dockerfile.ppc64le has been updated to use UBI9 as the base image and dependencies built from source

Build & Test Instructions

Build

# podman build -t vllmups -f Dockerfile.ppc64le . --jobs=0

# podman images
REPOSITORY                                   TAG             IMAGE ID      CREATED       SIZE
localhost/vllmups                            latest          941a39050cd7  43 hours ago  1.82 GB

Test

# podman run -idt --name=vllm --entrypoint=/bin/bash localhost/vllmups
# podman exec -it vllm bash

Verify OpenAI endpoint in the attached container

# python -m vllm.entrypoints.openai.api_server
INFO 03-24 14:28:18 [__init__.py:256] Automatically detected platform cpu.
INFO 03-24 14:28:20 [api_server.py:981] vLLM API server version 0.8.2.dev49+gda6ea29f.d20250322
INFO 03-24 14:28:20 [api_server.py:982] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='facebook/opt-125m', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 6.69MB/s]
INFO 03-24 14:28:20 [config.py:2549] For POWERPC, we cast models to bfloat16 instead of using float16 by default. Float16 is not currently supported for POWERPC.
WARNING 03-24 14:28:20 [config.py:2593] Casting torch.float16 to torch.bfloat16.
INFO 03-24 14:28:27 [config.py:585] This model supports multiple tasks: {'score', 'embed', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
WARNING 03-24 14:28:27 [arg_utils.py:1783] device type=cpu is not supported by the V1 Engine. Falling back to V0.
WARNING 03-24 14:28:27 [cpu.py:94] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
WARNING 03-24 14:28:27 [cpu.py:107] uni is not supported on CPU, fallback to mp distributed executor backend.
INFO 03-24 14:28:27 [api_server.py:241] Started engine process with PID 33
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 6.68MB/s]
vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 11.7MB/s]
merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 37.0MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 6.38MB/s]
INFO 03-24 14:28:30 [__init__.py:256] Automatically detected platform cpu.
INFO 03-24 14:28:31 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev49+gda6ea29f.d20250322) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.39MB/s]
INFO 03-24 14:28:32 [cpu.py:40] Using Torch SDPA backend.
INFO 03-24 14:28:32 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 03-24 14:28:32 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-24 14:28:33 [weight_utils.py:257] Using model weights format ['*.bin']
pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████| 251M/251M [00:01<00:00, 230MB/s]
INFO 03-24 14:28:34 [weight_utils.py:273] Time spent downloading weights for facebook/opt-125m: 1.617761 seconds
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.61it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.60it/s]

INFO 03-24 14:28:34 [loader.py:429] Loading weights took 0.18 seconds
INFO 03-24 14:28:34 [executor_base.py:111] # cpu blocks: 7281, # CPU blocks: 0
INFO 03-24 14:28:34 [executor_base.py:116] Maximum concurrency for 2048 tokens per request: 56.88x
INFO 03-24 14:28:34 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 0.14 seconds
INFO 03-24 14:28:35 [api_server.py:1028] Starting vLLM API server on http://0.0.0.0:8000
INFO 03-24 14:28:35 [launcher.py:26] Available routes are:
INFO 03-24 14:28:35 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET
INFO 03-24 14:28:35 [launcher.py:34] Route: /docs, Methods: HEAD, GET
INFO 03-24 14:28:35 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 03-24 14:28:35 [launcher.py:34] Route: /redoc, Methods: HEAD, GET
INFO 03-24 14:28:35 [launcher.py:34] Route: /health, Methods: GET
INFO 03-24 14:28:35 [launcher.py:34] Route: /load, Methods: GET
INFO 03-24 14:28:35 [launcher.py:34] Route: /ping, Methods: GET, POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 03-24 14:28:35 [launcher.py:34] Route: /version, Methods: GET
INFO 03-24 14:28:35 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /pooling, Methods: POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /score, Methods: POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /rerank, Methods: POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 03-24 14:28:35 [launcher.py:34] Route: /invocations, Methods: POST
INFO:     Started server process [25]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Mar 24, 2025
@Shafi-Hussain Shafi-Hussain force-pushed the vllm-dockerfile-ppc64le branch from 334b352 to 328128e Compare March 24, 2025 15:07
@DarkLight1337
Copy link
Member

Please fix the commit errors

@Shafi-Hussain Shafi-Hussain marked this pull request as draft March 25, 2025 05:15
@Shafi-Hussain Shafi-Hussain force-pushed the vllm-dockerfile-ppc64le branch from 67c1530 to 328128e Compare March 25, 2025 06:08
@Shafi-Hussain Shafi-Hussain marked this pull request as ready for review March 25, 2025 06:43
@Shafi-Hussain
Copy link
Contributor Author

Please fix the commit errors

@DarkLight1337 The failed buildkite jobs are failing with some other Dockerfile with nvidia base image and not for the changes made in this PR

@DarkLight1337
Copy link
Member

Can you push a commit to trigger a new build?

@mkumatag
Copy link

Can you push a commit to trigger a new build?

may be a dummy commit.

Signed-off-by: Md. Shafi Hussain <[email protected]>
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) March 25, 2025 07:06
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 25, 2025
@DarkLight1337 DarkLight1337 merged commit 3e2f37a into vllm-project:main Mar 25, 2025
67 checks passed
erictang000 pushed a commit to erictang000/vllm that referenced this pull request Mar 25, 2025
wrmedford pushed a commit to wrmedford/vllm that referenced this pull request Mar 26, 2025
lengrongfu pushed a commit to lengrongfu/vllm that referenced this pull request Apr 2, 2025
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants