An Empirical Study · 306 Runs · Qwen3.5-35B-A3B · Apple M2 Ultra 实证研究 · 306 次实验 · Qwen3.5-35B-A3B · Apple M2 Ultra
The key finding, explained simply. 核心发现,用大白话说。
Speculative decoding (SD) uses a tiny "draft" model to guess what a big model will say, then the big model checks all the guesses at once. Usually this works because the draft gets many guesses right. But when the draft is tiny (0.8B) and the target is huge (35B MoE), almost none of the guesses are correct (<4%). So why does it still speed things up by 18–30%?
The secret: checking 16 wrong answers at once is still faster than generating 1 answer at a time. The big model has to load all 35 billion parameters from memory for each answer. When it checks 16 draft tokens in one batch, it loads those weights once instead of 16 separate times. This "batch verification amortization" saves memory bandwidth — which is the real bottleneck for MoE models.
推测解码(SD)的原理是:用一个小模型先"猜"大模型接下来要说什么,然后大模型一次性检查所有猜测。通常它能加速是因为小模型猜得准。但当小模型很小(0.8B)、大模型很大(35B MoE)时,几乎所有猜测都是错的(正确率不到4%)。那为什么还能加速18-30%?
秘密在于:一次性检查16个错误答案,仍然比一个一个地生成答案要快。大模型每生成一个token都要把350亿参数从内存里读一遍。当它一次性检查16个草稿token时,这些参数只需要读一次而不是16次。这种"批量验证摊销"节省了内存带宽——这才是MoE模型真正的瓶颈。
Traditional SD relies on draft accuracy. On MoE, a different mechanism dominates. 传统SD依赖草稿的准确性。在MoE上,起作用的是另一个机制。
Traditional SD theory says: low acceptance rate = no speedup. Our MoE results break this rule. With <4% acceptance (essentially all drafts rejected), we still get 1.30× speedup. The key insight is that on memory-bandwidth-bound models (where weight loading is the bottleneck, not computation), batch verification itself is the optimization — not the accepted tokens. This is a different speedup mechanism from what textbooks describe. 传统SD理论说:低接受率 = 没加速。我们的MoE结果打破了这个规则。即使接受率不到4%(基本上所有草稿都被拒绝),我们仍然获得1.30×加速。关键洞察是:对于受内存带宽限制的模型(瓶颈在于读取参数而非计算),批量验证本身就是优化——被接受的token反而不重要。这和教科书里描述的加速机制是不同的。
1.18–1.30× speedup despite <4% acceptance rate. The speedup comes from batch verification amortizing memory bandwidth, not from accepted draft tokens. 1.18–1.30× 加速,尽管接受率低于4%。加速来自批量验证对内存带宽的摊销,而非被接受的草稿token。
SD speedup scales with total parameter footprint (memory bandwidth), not active compute. MoE with 35B total, 3B active gets more speedup than Dense 4B but less than Dense 9B. Dense 4B is the key control: its active params (4B) ≈ MoE's (3B), so the only differing variable is total params — isolating total footprint as the causal driver. SD加速与总参数量(内存带宽)成正比,而非活跃计算量。35B总量、3B活跃的MoE加速效果介于Dense 4B和Dense 9B之间。Dense 4B是关键对照:其活跃参数(4B)≈ MoE(3B),因此唯一不同的变量是总参数量——由此分离出总参数量作为因果驱动因素。
The 0.8B draft beats 2B by 0.12–0.14× despite lower accuracy. Since almost no drafts are accepted anyway, the cheapest drafter wins. 0.8B草稿模型胜过2B,高出0.12–0.14×,尽管准确率更低。既然反正几乎没有草稿被接受,最便宜的选手赢了。
γ=16 beats γ=4 even though acceptance drops from 2.6% to 0.2%. Batch verification cost grows sub-linearly — verifying 16 tokens costs only ~1.5× as much as verifying 4. γ=16 胜过 γ=4,即使接受率从2.6%降到0.2%。批量验证成本低于线性增长——验证16个token的成本仅为验证4个的约1.5倍。
306 runs across 5 experiment suites on Apple M2 Ultra (192 GB, 800 GB/s). 在 Apple M2 Ultra(192 GB, 800 GB/s)上完成 5 组共 306 次实验。
| Model模型 | Active活跃参数 | Total总参数 | tok/s | Note备注 |
|---|---|---|---|---|
| Dense 0.8B | 0.8B | 0.8B | 138.0 | Draft model (fastest)草稿模型(最快) |
| Dense 2B | 2.0B | 2.0B | 110.4 | Larger draft较大的草稿模型 |
| ★ MoE Q4 (35B-A3B) | 3.0B | 35B | 55.3 | 3B active but 35B in memory3B活跃但35B占内存 |
| ★ MoE Q8 (35B-A3B) | 3.0B | 35B | 49.9 | Higher precision, larger file更高精度,文件更大 |
| Dense 4B | 4.0B | 4.0B | 67.2 | More active params than MoE, but faster!活跃参数比MoE多,但反而更快! |
| Dense 9B | 9.0B | 9.0B | 33.4 | Most bandwidth-bound最受带宽限制 |
The MoE model has only 3B active parameters — less than the Dense 4B. So it should be faster, right? Wrong. MoE runs at 55 tok/s while Dense 4B runs at 67 tok/s. That's because MoE has 35B total parameters sitting in memory, and the chip has to load all of them for each token, even though only 3B are used. This is the "MoE tax" — you pay for the whole building even if you only use one office. MoE模型只有30亿活跃参数——比Dense 4B还少。所以它应该更快对吧?错了。MoE只跑55 tok/s,而Dense 4B跑67 tok/s。因为MoE的350亿总参数全部坐在内存里,芯片每生成一个token都要读取全部参数,尽管只有30亿被用到。这就是"MoE税"——你为整栋大楼付费,即使你只用了一间办公室。
| Target目标模型 | γ | tok/s | Speedup加速 | Acceptance接受率 |
|---|---|---|---|---|
| MoE Q4 | 4 | 63.4 | 1.15× | 2.6% |
| MoE Q4 | 8 | 65.2 | 1.18× | 1.1% |
| ★ MoE Q4 | 16 | 69.7 | 1.26× | 0.2% |
| MoE Q8 | 4 | 57.5 | 1.15× | 2.7% |
| MoE Q8 | 8 | 60.9 | 1.22× | 1.0% |
| ★ MoE Q8 | 16 | 64.8 | 1.30× | 0.2% |
| Dense 4B | 4 | 70.9 | 1.05× | 3.0% |
| Dense 4B | 8 | 65.9 | 0.98× | 1.1% |
| Dense 4B | 16 | 75.1 | 1.12× | 0.4% |
| Dense 9B | 4 | 32.8 | 0.98× | 2.3% |
| Dense 9B | 8 | 55.0 | 1.64× | 0.9% |
| ★ Dense 9B | 16 | 67.7 | 2.03× | 0.4% |
| Target目标模型 | γ | Draft 0.8B草稿 0.8B | Draft 2B草稿 2B | Winner胜出 | ||
|---|---|---|---|---|---|---|
| Speedup加速 | Acceptance接受率 | Speedup加速 | Acceptance接受率 | |||
| MoE Q4 | 4 | 1.15× | 2.6% | 1.03× | 4.0% | 0.8B ✓ |
| MoE Q4 | 8 | 1.18× | 1.1% | 1.05× | 1.7% | 0.8B ✓ |
| MoE Q4 | 16 | 1.26× | 0.2% | 1.12× | 0.9% | 0.8B ✓ |
| MoE Q8 | 4 | 1.15× | 2.7% | 1.03× | 3.6% | 0.8B ✓ |
| MoE Q8 | 8 | 1.22× | 1.0% | 1.09× | 2.0% | 0.8B ✓ |
| MoE Q8 | 16 | 1.30× | 0.2% | 1.16× | 0.9% | 0.8B ✓ |
Why compare with Dense 4B instead of Dense 9B? Dense 4B has 4B active params — close to MoE's 3B active params. If speedup only depended on active compute, they should be similar. But MoE gets higher speedup — the only explanation is its larger total parameter footprint (35B vs 4B). Dense 9B differs from MoE in both active and total params, so it cannot isolate which factor matters. We use Dense 9B as a bandwidth upper-bound reference, and Dense 4B as the controlled causal comparison. 为什么和 Dense 4B 比,而不是 Dense 9B? Dense 4B 的活跃参数(4B)≈ MoE 的活跃参数(3B)。如果加速只取决于活跃计算量,两者应差不多。但 MoE 的加速反而更高——唯一的解释是其更大的总参数量(35B vs 4B)。Dense 9B 的活跃参数和总参数都与 MoE 不同,无法分离哪个因素起作用。因此我们用 Dense 9B 作为带宽上限参照,用 Dense 4B 作为控制变量的因果对比。
| Component组件 | Specification规格 |
|---|---|
| Chip | Apple M2 Ultra |
| Unified Memory统一内存 | 192 GB (LPDDR5) |
| Memory Bandwidth内存带宽 | 800 GB/s |
| GPU Cores | 76 (Metal) |
| CPU Cores | 24 (16P + 8E) |
| Framework | llama.cpp v8240 / v8280 |
| Target Model目标模型 | Qwen3.5-35B-A3B (Q4_K_M / Q8_0) |
| Draft Models草稿模型 | Qwen3.5-0.8B (Q8), Qwen3.5-2B (Q8) |
| Total Runs总运行次数 | 306 (5 experiment suites) |
| Duration实验总时长 | ~41 minutes |
@misc{atomgradient2026sdmoe,
title = {Does Speculative Decoding Help Mixture-of-Experts?
An Empirical Study on Qwen3.5-35B-A3B},
author = {AtomGradient},
year = {2026},
url = {https://github.com/AtomGradient/speculative-moe-research}
}