You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Morever, Grok seems adamant that it's completely valid to set the number of active experts PER REQUEST, without retraining/fine-tuning.
If this is all correct, then please add the ability to tweak the number of experts on the fly, per request, as a hyperparameter -- it would be hugely valuable for trading intelligence for speed at runtime.
Motivation
Consider, for example, aider's architect vs. coder modes. We could choose at runtime how much intelligence we want from the model, changing the hyperparameters sent with the aider.model.metata.yml file (especially since llama.cpp will ignore the model name, we can just create two model aliases with a different number of experts activated).
Possible Implementation
From Grok:
Technical Feasibility
In a typical MoE implementation, the gating network computes a probability distribution or score for every expert for a given input. During inference, the model selects the top-k experts based on these scores, computes their outputs, and combines them according to their weights. Here’s how this works in practice:
Gating Scores: The gating network outputs scores for all experts (e.g., 128 scores in Qwen3).
Top-k Selection: The inference process selects the k highest-scoring experts (e.g., k=8 during training).
Output Computation: Only the selected experts process the input, and their results are combined.
Since the gating network provides a full ranking of experts, it’s technically possible to adjust k at runtime. For example:
If you set k=4, you’d select the top-4 experts instead of the top-8.
If you set k=16, you’d select the top-16 experts.
This adjustment doesn’t require retraining because the gating scores are already computed; you’re simply changing how many experts you pick from that ranked list. In code terms, this might look like:
python
Copy
gate_scores = gate_network(input) # Scores for all 128 experts
top_k_indices = torch.topk(gate_scores, k=num_active_experts).indices # Select top-k
selected_experts = [experts[i] for i in top_k_indices] # Dynamic k
output = sum(weight * expert(input) for weight, expert in zip(top_k_weights, selected_experts))
Here, num_active_experts can be a hyperparameter passed to the inference server, making it configurable per request.
Are Normal MoE Models Capable of This?
Yes, normal MoE models are inherently capable of this flexibility, provided the implementation allows it. Most standard MoE frameworks—such as those in PyTorch, Hugging Face Transformers, or inference servers like vLLM—compute gating scores for all experts and use a configurable top-k selection. For Qwen3 30B A3B or similar models, the ability to change k depends on the inference code rather than the model architecture itself.
The text was updated successfully, but these errors were encountered:
Morever, Grok seems adamant that it's completely valid to set the number of active experts PER REQUEST, without retraining/fine-tuning.
The problem with this idea is that the expected magnitude of the sum of m i.i.d. random vectors when the model normally outputs n i.i.d. random vectors, is approximately sqrt(m/n). The vectors are very high dimensional and experimentation shows the i.i.d. random assumption isn't far off.
This appears to be the main reason why decreasing or increasing the number of experts fails, and is akin to downscaling/upscaling the down_proj matrix in a non-MoE LLM (ie: the expected hidden state magnitude isn't preserved).
Prerequisites
Feature Description
Models like this exist, which change nothing but the number of active experts in a MoE model: https://huggingface.co/DavidAU/Qwen3-30B-A6B-16-Extreme/ . Redditors seems to believe this is a smarter model: https://old.reddit.com/r/LocalLLaMA/comments/1kmlu2y/qwen330ba6b16extreme_is_fantastic/
Morever, Grok seems adamant that it's completely valid to set the number of active experts PER REQUEST, without retraining/fine-tuning.
If this is all correct, then please add the ability to tweak the number of experts on the fly, per request, as a hyperparameter -- it would be hugely valuable for trading intelligence for speed at runtime.
Motivation
Consider, for example, aider's architect vs. coder modes. We could choose at runtime how much intelligence we want from the model, changing the hyperparameters sent with the aider.model.metata.yml file (especially since llama.cpp will ignore the model name, we can just create two model aliases with a different number of experts activated).
Possible Implementation
From Grok:
Technical Feasibility
In a typical MoE implementation, the gating network computes a probability distribution or score for every expert for a given input. During inference, the model selects the top-k experts based on these scores, computes their outputs, and combines them according to their weights. Here’s how this works in practice:
Gating Scores: The gating network outputs scores for all experts (e.g., 128 scores in Qwen3).
Top-k Selection: The inference process selects the k highest-scoring experts (e.g., k=8 during training).
Output Computation: Only the selected experts process the input, and their results are combined.
Since the gating network provides a full ranking of experts, it’s technically possible to adjust k at runtime. For example:
If you set k=4, you’d select the top-4 experts instead of the top-8.
If you set k=16, you’d select the top-16 experts.
This adjustment doesn’t require retraining because the gating scores are already computed; you’re simply changing how many experts you pick from that ranked list. In code terms, this might look like:
python
Copy
gate_scores = gate_network(input) # Scores for all 128 experts
top_k_indices = torch.topk(gate_scores, k=num_active_experts).indices # Select top-k
selected_experts = [experts[i] for i in top_k_indices] # Dynamic k
output = sum(weight * expert(input) for weight, expert in zip(top_k_weights, selected_experts))
Here, num_active_experts can be a hyperparameter passed to the inference server, making it configurable per request.
Are Normal MoE Models Capable of This?
Yes, normal MoE models are inherently capable of this flexibility, provided the implementation allows it. Most standard MoE frameworks—such as those in PyTorch, Hugging Face Transformers, or inference servers like vLLM—compute gating scores for all experts and use a configurable top-k selection. For Qwen3 30B A3B or similar models, the ability to change k depends on the inference code rather than the model architecture itself.
The text was updated successfully, but these errors were encountered: