Skip to content

Improve configs - SchedulerConfig #16533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 14, 2025
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 89 additions & 55 deletions vllm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -1522,6 +1522,9 @@ def __post_init__(self):
self.ignore_patterns = ["original/**/*"]


DistributedExecutorBackend = Literal["ray", "mp", "uni", "external_launcher"]


@config
@dataclass
class ParallelConfig:
Expand Down Expand Up @@ -1563,7 +1566,7 @@ class ParallelConfig:
placement_group: Optional["PlacementGroup"] = None
"""ray distributed model workers placement group."""

distributed_executor_backend: Optional[Union[str,
distributed_executor_backend: Optional[Union[DistributedExecutorBackend,
type["ExecutorBase"]]] = None
"""Backend to use for distributed model
workers, either "ray" or "mp" (multiprocessing). If the product
Expand Down Expand Up @@ -1687,7 +1690,7 @@ def __post_init__(self) -> None:
# current node and we aren't in a ray placement group.

from vllm.executor import ray_utils
backend = "mp"
backend: DistributedExecutorBackend = "mp"
ray_found = ray_utils.ray_is_available()
if current_platform.is_neuron():
# neuron uses single process to control multiple devices
Expand Down Expand Up @@ -1755,92 +1758,123 @@ def _verify_args(self) -> None:
"worker_extension_cls must be a string (qualified class name).")


SchedulerPolicy = Literal["fcfs", "priority"]


@config
@dataclass
class SchedulerConfig:
"""Scheduler configuration."""

runner_type: str = "generate" # The runner type to launch for the model.
runner_type: RunnerType = "generate"
"""The runner type to launch for the model."""

# Maximum number of tokens to be processed in a single iteration.
max_num_batched_tokens: int = field(default=None) # type: ignore
max_num_batched_tokens: int = None # type: ignore
"""Maximum number of tokens to be processed in a single iteration.

This config has no static default. If left unspecified by the user, it will
be set in `EngineArgs.create_engine_config` based on the usage context."""

# Maximum number of sequences to be processed in a single iteration.
max_num_seqs: int = 128
max_num_seqs: int = None # type: ignore
Copy link
Member

@DarkLight1337 DarkLight1337 Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since None is resolved inside EngineArgs or ModelConfig, I feel that we should not allow it to be None when initializing SchedulerConfig

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is None so that the default in EngineArgs is None.

We have the same problem with the max_num_batched_tokens and max_model_length. It can't be optional in the config because maths is done on them in __post_init__, but they must have None defaults so that EngineArgs can have them as optional.

Where would you change behaviour so that the defaults here are not None?

Copy link
Member

@DarkLight1337 DarkLight1337 Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can have a from_optional method just like the one for SamplingParams to resolve the None values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using post init to overwrite None values leads to a bunch of issues with type checking since downstream access of these attributes will unnecessarily need to consider the None case in order to avoid type checking errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can have a from_optional method just like the one for SamplingParams to resolve the None values

Would we then need to change everywhere SchedulingConfig is instantiated? And then remember to instantiate it that way?

Using post init to overwrite None values leads to a bunch of issues with type checking since downstream access of these attributes will unnecessarily need to consider the None case in order to avoid type checking errors.

I agree, but that's why there in SchedulingConfig the type is int but in EngineArgs the type is Optional[int], so that the type cheking is happy.

Copy link
Member Author

@hmellor hmellor Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth noting that prior to this PR, this is how we handled max _num_batched_tokens.

I changed the other two here so that they would no longer have arbitrary defaults that could be inherited by EngineArgs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per offline discussion, this is the cleanest way given how the configs are set up right now. We can address this later

"""Maximum number of sequences to be processed in a single iteration.

This config has no static default. If left unspecified by the user, it will
be set in `EngineArgs.create_engine_config` based on the usage context."""

# Maximum length of a sequence (including prompt and generated text).
max_model_len: int = 8192
max_model_len: int = None # type: ignore
"""Maximum length of a sequence (including prompt and generated text). This
is usually set in ModelConfig and should not be set here."""

# Maximum number of sequences that can be partially prefilled concurrently
max_num_partial_prefills: int = 1
"""For chunked prefill, the maximum number of sequences that can be
partially prefilled concurrently."""

# Maximum number of "very long prompt" sequences that can be prefilled
# concurrently (long is defined by long_prefill_threshold)
max_long_partial_prefills: int = 1
"""For chunked prefill, the maximum number of prompts longer than
long_prefill_token_threshold that will be prefilled concurrently. Setting
this less than max_num_partial_prefills will allow shorter prompts to jump
the queue in front of longer prompts in some cases, improving latency."""

# calculate context length that determines which sequences are
# considered "long"
long_prefill_token_threshold: int = 0
"""For chunked prefill, a request is considered long if the prompt is
longer than this number of tokens."""

# The number of slots to allocate per sequence per
# step, beyond the known token ids. This is used in speculative
# decoding to store KV activations of tokens which may or may not be
# accepted.
num_lookahead_slots: int = 0
"""The number of slots to allocate per sequence per
step, beyond the known token ids. This is used in speculative
decoding to store KV activations of tokens which may or may not be
accepted.

NOTE: This will be replaced by speculative config in the future; it is
present to enable correctness tests until then."""

# Apply a delay (of delay factor multiplied by previous
# prompt latency) before scheduling next prompt.
delay_factor: float = 0.0
"""Apply a delay (of delay factor multiplied by previous
prompt latency) before scheduling next prompt."""

# If True, prefill requests can be chunked based
# on the remaining max_num_batched_tokens.
enable_chunked_prefill: bool = False
enable_chunked_prefill: bool = None # type: ignore
"""If True, prefill requests can be chunked based
on the remaining max_num_batched_tokens."""

is_multimodal_model: bool = False
"""True if the model is multimodal."""

# TODO (ywang96): Make this configurable.
max_num_encoder_input_tokens: int = field(init=False)
"""Multimodal encoder compute budget, only used in V1.

NOTE: This is not currently configurable. It will be overridden by
max_num_batched_tokens in case max multimodal embedding size is larger."""

# TODO (ywang96): Make this configurable.
encoder_cache_size: int = field(init=False)
"""Multimodal encoder cache size, only used in V1.

NOTE: This is not currently configurable. It will be overridden by
max_num_batched_tokens in case max multimodal embedding size is larger."""

# NOTE: The following multimodal encoder budget will be initialized to
# max_num_batched_tokens and overridden in case max multimodal embedding
# size is larger.
# TODO (ywang96): Make these configurable.
# Multimodal encoder compute budget, only used in V1
max_num_encoder_input_tokens: int = field(default=None) # type: ignore

# Multimodal encoder cache size, only used in V1
encoder_cache_size: int = field(default=None) # type: ignore

# Whether to perform preemption by swapping or
# recomputation. If not specified, we determine the mode as follows:
# We use recomputation by default since it incurs lower overhead than
# swapping. However, when the sequence group has multiple sequences
# (e.g., beam search), recomputation is not currently supported. In
# such a case, we use swapping instead.
preemption_mode: Optional[str] = None
"""Whether to perform preemption by swapping or
recomputation. If not specified, we determine the mode as follows:
We use recomputation by default since it incurs lower overhead than
swapping. However, when the sequence group has multiple sequences
(e.g., beam search), recomputation is not currently supported. In
such a case, we use swapping instead."""

num_scheduler_steps: int = 1
"""Maximum number of forward steps per scheduler call."""

multi_step_stream_outputs: bool = False
multi_step_stream_outputs: bool = True
"""If False, then multi-step will stream outputs at the end of all steps"""

# Private API. If used, scheduler sends delta data to
# workers instead of an entire data. It should be enabled only
# when SPMD worker architecture is enabled. I.e.,
# VLLM_USE_RAY_SPMD_WORKER=1
send_delta_data: bool = False

# The scheduling policy to use. "fcfs" (default) or "priority".
policy: str = "fcfs"
"""Private API. If used, scheduler sends delta data to
workers instead of an entire data. It should be enabled only
when SPMD worker architecture is enabled. I.e.,
VLLM_USE_RAY_SPMD_WORKER=1"""

policy: SchedulerPolicy = "fcfs"
"""The scheduling policy to use:\n
- "fcfs" means first come first served, i.e. requests are handled in order
of arrival.\n
- "priority" means requests are handled based on given priority (lower
value means earlier handling) and time of arrival deciding any ties)."""

chunked_prefill_enabled: bool = field(init=False)
"""True if chunked prefill is enabled."""

# If set to true and chunked prefill is enabled, we do not want to
# partially schedule a multimodal item. Only used in V1
# This ensures that if a request has a mixed prompt
# (like text tokens TTTT followed by image tokens IIIIIIIIII) where only
# some image tokens can be scheduled (like TTTTIIIII, leaving IIIII),
# it will be scheduled as TTTT in one step and IIIIIIIIII in the next.
disable_chunked_mm_input: bool = False
"""If set to true and chunked prefill is enabled, we do not want to
partially schedule a multimodal item. Only used in V1
This ensures that if a request has a mixed prompt
(like text tokens TTTT followed by image tokens IIIIIIIIII) where only
some image tokens can be scheduled (like TTTTIIIII, leaving IIIII),
it will be scheduled as TTTT in one step and IIIIIIIIII in the next."""

# scheduler class or path. "vllm.core.scheduler.Scheduler" (default)
# or "mod.custom_class".
scheduler_cls: Union[str, type[object]] = "vllm.core.scheduler.Scheduler"
"""The scheduler class to use. "vllm.core.scheduler.Scheduler" is the
default scheduler. Can be a class directly or the path to a class of form
"mod.custom_class"."""

def compute_hash(self) -> str:
"""
Expand Down
Loading