Feature Request: dynamic number of experts (hyperparam per request) #13572

lee-b · 2025-05-15T17:38:54Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Models like this exist, which change nothing but the number of active experts in a MoE model: https://huggingface.co/DavidAU/Qwen3-30B-A6B-16-Extreme/ . Redditors seems to believe this is a smarter model: https://old.reddit.com/r/LocalLLaMA/comments/1kmlu2y/qwen330ba6b16extreme_is_fantastic/

Morever, Grok seems adamant that it's completely valid to set the number of active experts PER REQUEST, without retraining/fine-tuning.

If this is all correct, then please add the ability to tweak the number of experts on the fly, per request, as a hyperparameter -- it would be hugely valuable for trading intelligence for speed at runtime.

Motivation

Consider, for example, aider's architect vs. coder modes. We could choose at runtime how much intelligence we want from the model, changing the hyperparameters sent with the aider.model.metata.yml file (especially since llama.cpp will ignore the model name, we can just create two model aliases with a different number of experts activated).

Possible Implementation

From Grok:

Technical Feasibility
In a typical MoE implementation, the gating network computes a probability distribution or score for every expert for a given input. During inference, the model selects the top-k experts based on these scores, computes their outputs, and combines them according to their weights. Here’s how this works in practice:

Gating Scores: The gating network outputs scores for all experts (e.g., 128 scores in Qwen3).
Top-k Selection: The inference process selects the k highest-scoring experts (e.g., k=8 during training).
Output Computation: Only the selected experts process the input, and their results are combined.
Since the gating network provides a full ranking of experts, it’s technically possible to adjust k at runtime. For example:

If you set k=4, you’d select the top-4 experts instead of the top-8.
If you set k=16, you’d select the top-16 experts.
This adjustment doesn’t require retraining because the gating scores are already computed; you’re simply changing how many experts you pick from that ranked list. In code terms, this might look like:

python

Copy
gate_scores = gate_network(input) # Scores for all 128 experts
top_k_indices = torch.topk(gate_scores, k=num_active_experts).indices # Select top-k
selected_experts = [experts[i] for i in top_k_indices] # Dynamic k
output = sum(weight * expert(input) for weight, expert in zip(top_k_weights, selected_experts))
Here, num_active_experts can be a hyperparameter passed to the inference server, making it configurable per request.

Are Normal MoE Models Capable of This?
Yes, normal MoE models are inherently capable of this flexibility, provided the implementation allows it. Most standard MoE frameworks—such as those in PyTorch, Hugging Face Transformers, or inference servers like vLLM—compute gating scores for all experts and use a configurable top-k selection. For Qwen3 30B A3B or similar models, the ability to change k depends on the inference code rather than the model architecture itself.

jukofyork · 2025-05-19T07:28:02Z

Morever, Grok seems adamant that it's completely valid to set the number of active experts PER REQUEST, without retraining/fine-tuning.

The problem with this idea is that the expected magnitude of the sum of m i.i.d. random vectors when the model normally outputs n i.i.d. random vectors, is approximately sqrt(m/n). The vectors are very high dimensional and experimentation shows the i.i.d. random assumption isn't far off.

I explained this more in this post:

#11446 (comment)

This appears to be the main reason why decreasing or increasing the number of experts fails, and is akin to downscaling/upscaling the down_proj matrix in a non-MoE LLM (ie: the expected hidden state magnitude isn't preserved).

lee-b added the enhancement New feature or request label May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: dynamic number of experts (hyperparam per request) #13572

Feature Request: dynamic number of experts (hyperparam per request) #13572

lee-b commented May 15, 2025

jukofyork commented May 19, 2025 •

edited

Loading

Uh oh!

Feature Request: dynamic number of experts (hyperparam per request) #13572

Feature Request: dynamic number of experts (hyperparam per request) #13572

Comments

lee-b commented May 15, 2025

Prerequisites

Feature Description

Motivation

Possible Implementation

jukofyork commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented May 19, 2025 •

edited

Loading