Skip to content

Feature Request: dynamic number of experts (hyperparam per request) #13572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
lee-b opened this issue May 15, 2025 · 1 comment
Open
4 tasks done

Feature Request: dynamic number of experts (hyperparam per request) #13572

lee-b opened this issue May 15, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@lee-b
Copy link

lee-b commented May 15, 2025

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Models like this exist, which change nothing but the number of active experts in a MoE model: https://huggingface.co/DavidAU/Qwen3-30B-A6B-16-Extreme/ . Redditors seems to believe this is a smarter model: https://old.reddit.com/r/LocalLLaMA/comments/1kmlu2y/qwen330ba6b16extreme_is_fantastic/

Morever, Grok seems adamant that it's completely valid to set the number of active experts PER REQUEST, without retraining/fine-tuning.

If this is all correct, then please add the ability to tweak the number of experts on the fly, per request, as a hyperparameter -- it would be hugely valuable for trading intelligence for speed at runtime.

Motivation

Consider, for example, aider's architect vs. coder modes. We could choose at runtime how much intelligence we want from the model, changing the hyperparameters sent with the aider.model.metata.yml file (especially since llama.cpp will ignore the model name, we can just create two model aliases with a different number of experts activated).

Possible Implementation

From Grok:

Technical Feasibility
In a typical MoE implementation, the gating network computes a probability distribution or score for every expert for a given input. During inference, the model selects the top-k experts based on these scores, computes their outputs, and combines them according to their weights. Here’s how this works in practice:

Gating Scores: The gating network outputs scores for all experts (e.g., 128 scores in Qwen3).
Top-k Selection: The inference process selects the k highest-scoring experts (e.g., k=8 during training).
Output Computation: Only the selected experts process the input, and their results are combined.
Since the gating network provides a full ranking of experts, it’s technically possible to adjust k at runtime. For example:

If you set k=4, you’d select the top-4 experts instead of the top-8.
If you set k=16, you’d select the top-16 experts.
This adjustment doesn’t require retraining because the gating scores are already computed; you’re simply changing how many experts you pick from that ranked list. In code terms, this might look like:

python

Copy
gate_scores = gate_network(input) # Scores for all 128 experts
top_k_indices = torch.topk(gate_scores, k=num_active_experts).indices # Select top-k
selected_experts = [experts[i] for i in top_k_indices] # Dynamic k
output = sum(weight * expert(input) for weight, expert in zip(top_k_weights, selected_experts))
Here, num_active_experts can be a hyperparameter passed to the inference server, making it configurable per request.

Are Normal MoE Models Capable of This?
Yes, normal MoE models are inherently capable of this flexibility, provided the implementation allows it. Most standard MoE frameworks—such as those in PyTorch, Hugging Face Transformers, or inference servers like vLLM—compute gating scores for all experts and use a configurable top-k selection. For Qwen3 30B A3B or similar models, the ability to change k depends on the inference code rather than the model architecture itself.

@lee-b lee-b added the enhancement New feature or request label May 15, 2025
@jukofyork
Copy link
Collaborator

jukofyork commented May 19, 2025

Morever, Grok seems adamant that it's completely valid to set the number of active experts PER REQUEST, without retraining/fine-tuning.

The problem with this idea is that the expected magnitude of the sum of m i.i.d. random vectors when the model normally outputs n i.i.d. random vectors, is approximately sqrt(m/n). The vectors are very high dimensional and experimentation shows the i.i.d. random assumption isn't far off.

I explained this more in this post:

#11446 (comment)

This appears to be the main reason why decreasing or increasing the number of experts fails, and is akin to downscaling/upscaling the down_proj matrix in a non-MoE LLM (ie: the expected hidden state magnitude isn't preserved).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants