Does Speculative Decoding Help MoE?

TL;DR — One-Minute Summary一句话总结

The key finding, explained simply. 核心发现，用大白话说。

🎯 The Big Picture🎯 大白话版结论

Speculative decoding (SD) uses a tiny "draft" model to guess what a big model will say, then the big model checks all the guesses at once. Usually this works because the draft gets many guesses right. But when the draft is tiny (0.8B) and the target is huge (35B MoE), almost none of the guesses are correct (<4%). So why does it still speed things up by 18–30%?

The secret: checking 16 wrong answers at once is still faster than generating 1 answer at a time. The big model has to load all 35 billion parameters from memory for each answer. When it checks 16 draft tokens in one batch, it loads those weights once instead of 16 separate times. This "batch verification amortization" saves memory bandwidth — which is the real bottleneck for MoE models. 推测解码（SD）的原理是：用一个小模型先"猜"大模型接下来要说什么，然后大模型一次性检查所有猜测。通常它能加速是因为小模型猜得准。但当小模型很小（0.8B）、大模型很大（35B MoE）时，几乎所有猜测都是错的（正确率不到4%）。那为什么还能加速18-30%？

秘密在于：一次性检查16个错误答案，仍然比一个一个地生成答案要快。大模型每生成一个token都要把350亿参数从内存里读一遍。当它一次性检查16个草稿token时，这些参数只需要读一次而不是16次。这种"批量验证摊销"节省了内存带宽——这才是MoE模型真正的瓶颈。

🗣️ Plain-Language Q&A🗣️ 大白话问答

Q: What is MoE, and why is it special?问：什么是 MoE？有什么特别的？

MoE = Mixture of Experts. Imagine a company with 128 specialists (experts), but for each task, only 2 of them actually work on it. The model has 35 billion total parameters (all 128 experts), but only uses ~3 billion per token (the 2 active experts). This makes it smart like a 35B model but fast-ish like a 3B model. The catch? All 128 experts still need to sit in memory, even if most are idle. That's a lot of data to keep around. MoE = 混合专家模型。想象一个公司有128个专家，但每个任务只让其中2个人干活。这个模型总共有350亿参数（所有128个专家），但每个token只用到约30亿参数（2个活跃专家）。所以它聪明得像350亿的模型，但速度接近30亿的模型。问题是：即使大部分专家在"摸鱼"，他们也都得坐在内存里占位子。这很费内存。

Q: Why does SD help if the draft model gets everything wrong?问：小模型猜得全错，为什么还能加速？

Think of it like a delivery truck. Without SD, the truck makes 16 separate trips, loading all the goods each time. With SD, the truck loads once and delivers 16 packages in one trip — even if all packages are returned (rejected). The savings come from loading the truck once instead of 16 times, not from delivering the right packages. For MoE models, "loading the truck" = reading 35B parameters from memory, which takes ~18ms. Doing this once instead of 16 times saves huge time. 想象一辆送货卡车。没有SD时，卡车要跑16趟，每趟都要装满货物。有SD时，卡车装一次货就跑一趟送16个包裹——即使所有包裹都被退回（拒绝）。省时间的原因是只装了一次货，而不是装16次，跟包裹是否被签收没关系。对MoE模型来说，"装货" = 从内存读取350亿参数，每次约18毫秒。读1次而不是16次，省了大量时间。

Q: Why does a smaller draft model work better than a bigger one?问：为什么更小的小模型反而更好？

The 0.8B draft outperforms the 2B draft because neither guesses correctly anyway — both have <4% accuracy. But the 0.8B model runs 2.5× faster, so the "wasted" time on drafting is much smaller. Since the speedup comes from batch verification (not correct guesses), you want the draft to be as cheap as possible. Think of it as: if you're hiring someone to make guesses that will all be wrong anyway, hire the cheapest one. 0.8B小模型比2B小模型效果更好，因为反正两个都猜不准——准确率都不到4%。但0.8B运行速度快2.5倍，所以"浪费"在猜测上的时间少得多。既然加速来自批量验证而非正确猜测，你当然希望猜测过程越便宜越好。打个比方：如果你要雇人来猜答案但猜的肯定都是错的，那就雇最便宜的那个。

Q: Why does guessing more tokens (larger γ) give better speedup?问：为什么猜更多token（更大的γ）反而更快？

Counterintuitive, right? Guessing 16 tokens has a lower acceptance rate (0.2%) than guessing 4 tokens (2.6%). But the cost of verifying 16 tokens is NOT 4× the cost of verifying 4 tokens — it's maybe 1.5× thanks to GPU parallelism. So you get to "skip" more sequential forward passes for a relatively small extra cost. The truck analogy again: whether you put 4 or 16 packages on the truck, the fuel cost is almost the same because the heavy part is driving the route, not carrying the packages. 反直觉吧？猜16个token的接受率（0.2%）比猜4个（2.6%）低得多。但验证16个token的成本不是验证4个的4倍——得益于GPU并行计算，可能只是1.5倍。所以你用少量额外成本"跳过"了更多的逐个生成步骤。还是卡车的例子：不管你放4个还是16个包裹，油费几乎一样，因为费油的是跑路线这件事，不是搬包裹。

Q: Should I actually use this? What's the practical takeaway?问：我实际使用MoE模型时该怎么做？

Yes! If you're running an MoE model like Qwen3.5-35B-A3B on your Mac (or similar unified-memory hardware), enable speculative decoding with the smallest draft model you have (0.8B) and set γ=16. You'll get ~26–30% faster generation for free — no quality loss, minimal extra memory (~1 GB). It's basically a free lunch. 当然用！如果你在Mac（或类似统一内存硬件）上跑MoE模型如Qwen3.5-35B-A3B，开启推测解码，用最小的草稿模型（0.8B），设γ=16。你会免费获得约26-30%的加速——不损失质量，额外内存开销很小（约1GB）。这基本上是白送的午餐。

How Speculative Decoding Works on MoE推测解码在MoE上的工作原理

Traditional SD relies on draft accuracy. On MoE, a different mechanism dominates. 传统SD依赖草稿的准确性。在MoE上，起作用的是另一个机制。

❌ Without SD (Baseline)❌ 不用SD（基线）

Load 35B weights from memory (~18ms)从内存加载350亿参数（约18ms）

↓

Compute 1 token using 3B active params用30亿活跃参数计算1个token

↓

Repeat 256 times → 256 weight loads重复256次 → 加载256次参数

↓

⏱

55.3 tok/s

✅ With SD (γ=16)✅ 使用SD（γ=16）

Draft 16 tokens with 0.8B model (fast, ~2ms)用0.8B模型猜16个token（很快，约2ms）

↓

Load 35B weights once (~18ms)加载350亿参数一次（约18ms）

↓

Verify all 16 + output 1 token (batched, weights reused)批量验证16个token + 输出1个（权重复用）

↓

~16 rounds instead of 256 → ~16 weight loads约16轮而非256轮 → 约16次参数加载

↓

⏱

69.7 tok/s (+26%)

💡 Why This Is Surprising💡 为什么这很令人意外

Traditional SD theory says: low acceptance rate = no speedup. Our MoE results break this rule. With <4% acceptance (essentially all drafts rejected), we still get 1.30× speedup. The key insight is that on memory-bandwidth-bound models (where weight loading is the bottleneck, not computation), batch verification itself is the optimization — not the accepted tokens. This is a different speedup mechanism from what textbooks describe. 传统SD理论说：低接受率 = 没加速。我们的MoE结果打破了这个规则。即使接受率不到4%（基本上所有草稿都被拒绝），我们仍然获得1.30×加速。关键洞察是：对于受内存带宽限制的模型（瓶颈在于读取参数而非计算），批量验证本身就是优化——被接受的token反而不重要。这和教科书里描述的加速机制是不同的。

Key Findings四大发现

1. SD Works for MoE1. SD对MoE有效

1.18–1.30× speedup despite <4% acceptance rate. The speedup comes from batch verification amortizing memory bandwidth, not from accepted draft tokens. 1.18–1.30× 加速，尽管接受率低于4%。加速来自批量验证对内存带宽的摊销，而非被接受的草稿token。

2. Total Params Drive Benefit2. 总参数量决定收益

SD speedup scales with total parameter footprint (memory bandwidth), not active compute. MoE with 35B total, 3B active gets more speedup than Dense 4B but less than Dense 9B. Dense 4B is the key control: its active params (4B) ≈ MoE's (3B), so the only differing variable is total params — isolating total footprint as the causal driver. SD加速与总参数量（内存带宽）成正比，而非活跃计算量。35B总量、3B活跃的MoE加速效果介于Dense 4B和Dense 9B之间。Dense 4B是关键对照：其活跃参数（4B）≈ MoE（3B），因此唯一不同的变量是总参数量——由此分离出总参数量作为因果驱动因素。

3. Smaller Draft = More Speed3. 更小的草稿模型 = 更快

The 0.8B draft beats 2B by 0.12–0.14× despite lower accuracy. Since almost no drafts are accepted anyway, the cheapest drafter wins. 0.8B草稿模型胜过2B，高出0.12–0.14×，尽管准确率更低。既然反正几乎没有草稿被接受，最便宜的选手赢了。

4. More Guesses = Better4. 猜越多越好

γ=16 beats γ=4 even though acceptance drops from 2.6% to 0.2%. Batch verification cost grows sub-linearly — verifying 16 tokens costs only ~1.5× as much as verifying 4. γ=16 胜过 γ=4，即使接受率从2.6%降到0.2%。批量验证成本低于线性增长——验证16个token的成本仅为验证4个的约1.5倍。

Experiment Results实验数据

306 runs across 5 experiment suites on Apple M2 Ultra (192 GB, 800 GB/s). 在 Apple M2 Ultra（192 GB, 800 GB/s）上完成 5 组共 306 次实验。

Baseline Throughput (No Speculative Decoding) 基线吞吐量（无推测解码）

Model模型	Active活跃参数	Total总参数	tok/s	Note备注
Dense 0.8B	0.8B	0.8B	138.0	Draft model (fastest)草稿模型（最快）
Dense 2B	2.0B	2.0B	110.4	Larger draft较大的草稿模型
★ MoE Q4 (35B-A3B)	3.0B	35B	55.3	3B active but 35B in memory3B活跃但35B占内存
★ MoE Q8 (35B-A3B)	3.0B	35B	49.9	Higher precision, larger file更高精度，文件更大
Dense 4B	4.0B	4.0B	67.2	More active params than MoE, but faster!活跃参数比MoE多，但反而更快！
Dense 9B	9.0B	9.0B	33.4	Most bandwidth-bound最受带宽限制

🔍 Notice something weird?🔍 发现奇怪的地方了吗？

The MoE model has only 3B active parameters — less than the Dense 4B. So it should be faster, right? Wrong. MoE runs at 55 tok/s while Dense 4B runs at 67 tok/s. That's because MoE has 35B total parameters sitting in memory, and the chip has to load all of them for each token, even though only 3B are used. This is the "MoE tax" — you pay for the whole building even if you only use one office. MoE模型只有30亿活跃参数——比Dense 4B还少。所以它应该更快对吧？错了。MoE只跑55 tok/s，而Dense 4B跑67 tok/s。因为MoE的350亿总参数全部坐在内存里，芯片每生成一个token都要读取全部参数，尽管只有30亿被用到。这就是"MoE税"——你为整栋大楼付费，即使你只用了一间办公室。

Speculative Decoding Results (Draft: 0.8B) 推测解码结果（草稿模型：0.8B）

Target目标模型	γ	tok/s	Speedup加速	Acceptance接受率
MoE Q4	4	63.4	1.15×	2.6%
MoE Q4	8	65.2	1.18×	1.1%
★ MoE Q4	16	69.7	1.26×	0.2%
MoE Q8	4	57.5	1.15×	2.7%
MoE Q8	8	60.9	1.22×	1.0%
★ MoE Q8	16	64.8	1.30×	0.2%
Dense 4B	4	70.9	1.05×	3.0%
Dense 4B	8	65.9	0.98×	1.1%
Dense 4B	16	75.1	1.12×	0.4%
Dense 9B	4	32.8	0.98×	2.3%
Dense 9B	8	55.0	1.64×	0.9%
★ Dense 9B	16	67.7	2.03×	0.4%

Draft Model Comparison: 0.8B vs 2B on MoE 草稿模型对比：0.8B vs 2B（MoE目标）

Target目标模型	γ	Draft 0.8B草稿 0.8B		Draft 2B草稿 2B		Winner胜出
Target目标模型	γ	Speedup加速	Acceptance接受率	Speedup加速	Acceptance接受率	Winner胜出
MoE Q4	4	1.15×	2.6%	1.03×	4.0%	0.8B ✓
MoE Q4	8	1.18×	1.1%	1.05×	1.7%	0.8B ✓
MoE Q4	16	1.26×	0.2%	1.12×	0.9%	0.8B ✓
MoE Q8	4	1.15×	2.7%	1.03×	3.6%	0.8B ✓
MoE Q8	8	1.22×	1.0%	1.09×	2.0%	0.8B ✓
MoE Q8	16	1.30×	0.2%	1.16×	0.9%	0.8B ✓

📊 Reading the Data📊 怎么看这些数据

The paradox: 2B draft has higher acceptance but lower speedup. Why?悖论：2B草稿接受率更高，加速却更低。为什么？

At γ=4, the 2B draft has 4.0% acceptance vs 0.8B's 2.6%. But the 2B model takes ~2.5× longer to generate each draft token. The tiny accuracy advantage (accepting maybe 1 extra token out of 16) can't offset the much higher drafting cost. In the low-acceptance regime, draft speed matters more than draft quality. 在γ=4时，2B草稿有4.0%接受率，0.8B只有2.6%。但2B模型生成每个草稿token要慢约2.5倍。微小的准确率优势（16个token里也许多对1个）无法弥补更高的生成成本。在低接受率场景下，草稿速度比草稿质量更重要。

Figures实验图表

Figure 1: SD Speedup vs Active Parameter Count (draft: 0.8B, γ=8). The MoE models (★) at 3B active params get more speedup than the Dense 4B model — proving that total parameter footprint (memory bandwidth), not active compute, drives SD benefit. 图1：SD加速比 vs 活跃参数量（草稿：0.8B, γ=8）。MoE模型（★）虽然只有3B活跃参数，加速却高于Dense 4B——证明决定SD收益的是总参数量（内存带宽），而非活跃计算量。

Why compare with Dense 4B instead of Dense 9B? Dense 4B has 4B active params — close to MoE's 3B active params. If speedup only depended on active compute, they should be similar. But MoE gets higher speedup — the only explanation is its larger total parameter footprint (35B vs 4B). Dense 9B differs from MoE in both active and total params, so it cannot isolate which factor matters. We use Dense 9B as a bandwidth upper-bound reference, and Dense 4B as the controlled causal comparison. 为什么和 Dense 4B 比，而不是 Dense 9B？ Dense 4B 的活跃参数（4B）≈ MoE 的活跃参数（3B）。如果加速只取决于活跃计算量，两者应差不多。但 MoE 的加速反而更高——唯一的解释是其更大的总参数量（35B vs 4B）。Dense 9B 的活跃参数和总参数都与 MoE 不同，无法分离哪个因素起作用。因此我们用 Dense 9B 作为带宽上限参照，用 Dense 4B 作为控制变量的因果对比。

Figure 2: Absolute Throughput across all models. Purple = MoE, green = dense. SD consistently boosts bandwidth-bound models (MoE, Dense 9B). 图2：绝对吞吐量对比。紫色=MoE，绿色=Dense。SD对带宽受限的模型效果最好。

Figure 3: Acceptance Rate vs γ. All models show <4% acceptance — the speedup is NOT from accepted tokens. 图3：接受率 vs γ。所有模型接受率都低于4%——加速不来自被接受的token。

Figure 4: Speedup vs Prompt Length (MoE Q4). Stable at 1.19–1.26× regardless of prompt length, showing robust real-world applicability. 图4：加速比 vs Prompt长度（MoE Q4）。无论prompt多长都稳定在1.19–1.26×，说明实际应用中效果稳定。

Figure 5: Speedup vs γ. All curves go up — larger γ always helps. Dense 9B benefits the most (highest bandwidth demand). 图5：加速比 vs γ。所有曲线向上——更大的γ总是有帮助。Dense 9B受益最大（带宽需求最高）。

Component组件	Specification规格
Chip	Apple M2 Ultra
Unified Memory统一内存	192 GB (LPDDR5)
Memory Bandwidth内存带宽	800 GB/s
GPU Cores	76 (Metal)
CPU Cores	24 (16P + 8E)
Framework	llama.cpp v8240 / v8280
Target Model目标模型	Qwen3.5-35B-A3B (Q4_K_M / Q8_0)
Draft Models草稿模型	Qwen3.5-0.8B (Q8), Qwen3.5-2B (Q8)
Total Runs总运行次数	306 (5 experiment suites)
Duration实验总时长	~41 minutes

Does Speculative Decoding Help Mixture-of-Experts? 推测解码对混合专家模型有用吗？