An Empirical Study · 306 Runs · Qwen3.5-35B-A3B · Apple M2 Ultra 实证研究 · 306 次实验 · Qwen3.5-35B-A3B · Apple M2 Ultra
The key finding, explained simply. 核心发现,用大白话说。
Speculative decoding (SD) uses a tiny "draft" model to guess what a big model will say, then the big model checks all the guesses at once. Usually this works because the draft gets many guesses right. But when the draft is tiny (0.8B) and the target is huge (35B MoE), almost none of the guesses are correct (<4%). So why does it still speed things up by 18–30%?
The secret: checking 16 wrong answers at once is still faster than generating 1 answer at a time. The big model has to load all 35 billion parameters from memory for each answer. When it checks 16 draft tokens in one batch, it loads those weights once instead of 16 separate times. This "batch verification amortization" saves memory bandwidth — which is the real bottleneck for MoE models.
推测解码(SD)的原理是:用一个小模型先"猜"大模型接下来要说什么,然后大模型一次性检查所有猜测。通常它能加速是因为小模型猜得准。但当小模型很小(0.8B)、大模型很大(35B MoE)时,几乎所有猜测都是错的(正确率不到4%)。那为什么还能加速18-30%?
秘密在于:一次性检查16个错误答案,仍然比一个一个地生成答案要快。大模型每生成一个token都要把350亿参数从内存里读一遍。当它一次性检查16个草稿token时,这些参数只需要读一次而不是16次。这种"批量验证摊销"节省了内存带宽——这才是MoE模型真正的瓶颈。
Traditional SD relies on draft accuracy. On MoE, a different mechanism dominates. 传统SD依赖草稿的准确性。在MoE上,起作用的是另一个机制。
Traditional SD theory says: low acceptance rate = no speedup. Our MoE results break this rule. With <4% acceptance (essentially all drafts rejected), we still get 1.30× speedup. The key insight is that on memory-bandwidth-bound models (where weight loading is the bottleneck, not computation), batch verification itself is the optimization — not the accepted tokens. This is a different speedup mechanism from what textbooks describe. 传统SD理论说:低接受率 = 没加速。我们的MoE结果打破了这个规则。即使接受率不到4%(基本上所有草稿都被拒绝),我们仍然获得1.30×加速。关键洞察是:对于受内存带宽限制的模型(瓶颈在于读取参数而非计算),批量验证本身就是优化——被接受的token反而不重要。这和教科书里描述的加速机制是不同的。
1.18–1.30× speedup despite <4% acceptance rate. The speedup comes from batch verification amortizing memory bandwidth, not from accepted draft tokens. 1.18–1.30× 加速,尽管接受率低于4%。加速来自批量验证对内存带宽的摊销,而非被接受的草稿token。
SD speedup scales with total parameter footprint (memory bandwidth), not active compute. MoE with 35B total, 3B active gets more speedup than Dense 4B but less than Dense 9B. SD加速与总参数量(内存带宽)成正比,而非活跃计算量。35B总量、3B活跃的MoE加速效果介于Dense 4B和Dense 9B之间。
The 0.8B draft beats 2B by 0.12–0.14× despite lower accuracy. Since almost no drafts are accepted anyway, the cheapest drafter wins. 0.8B草稿模型胜过2B,高出0.12–0.14×,尽管准确率更低。既然反正几乎没有草稿被接受,最便宜的选手赢了。
γ=16 beats γ=4 even though acceptance drops from 2.6% to 0.2%. Batch verification cost grows sub-linearly — verifying 16 tokens costs only ~1.5× as much as verifying 4. γ=16 胜过 γ=4,即使接受率从2.6%降到0.2%。批量验证成本低于线性增长——验证16个token的成本仅为验证4个的约1.5倍。
306 runs across 5 experiment suites on Apple M2 Ultra (192 GB, 800 GB/s). 在 Apple M2 Ultra(192 GB, 800 GB/s)上完成 5 组共 306 次实验。
| Model模型 | Active活跃参数 | Total总参数 | tok/s | Note备注 |
|---|---|---|---|---|
| Dense 0.8B | 0.8B | 0.8B | 138.0 | Draft model (fastest)草稿模型(最快) |
| Dense 2B | 2.0B | 2.0B | 110.4 | Larger draft较大的草稿模型 |
| ★ MoE Q4 (35B-A3B) | 3.0B | 35B | 55.3 | 3B active but 35B in memory3B活跃但35B占内存 |
| ★ MoE Q8 (35B-A3B) | 3.0B | 35B | 49.9 | Higher precision, larger file更高精度,文件更大 |
| Dense 4B | 4.0B | 4.0B | 67.2 | More active params than MoE, but faster!活跃参数比MoE多,但反而更快! |
| Dense 9B | 9.0B | 9.0B | 33.4 | Most bandwidth-bound最受带宽限制 |
The MoE model has only 3B active parameters — less than the Dense 4B. So it should be faster, right? Wrong. MoE runs at 55 tok/s while Dense 4B runs at 67 tok/s. That's because MoE has 35B total parameters sitting in memory, and the chip has to load all of them for each token, even though only 3B are used. This is the "MoE tax" — you pay for the whole building even if you only use one office. MoE模型只有30亿活跃参数——比Dense 4B还少。所以它应该更快对吧?错了。MoE只跑55 tok/s,而Dense 4B跑67 tok/s。因为MoE的350亿总参数全部坐在内存里,芯片每生成一个token都要读取全部参数,尽管只有30亿被用到。这就是"MoE税"——你为整栋大楼付费,即使你只用了一间办公室。
| Target目标模型 | γ | tok/s | Speedup加速 | Acceptance接受率 |
|---|---|---|---|---|
| MoE Q4 | 4 | 63.4 | 1.15× | 2.6% |
| MoE Q4 | 8 | 65.2 | 1.18× | 1.1% |
| ★ MoE Q4 | 16 | 69.7 | 1.26× | 0.2% |
| MoE Q8 | 4 | 57.5 | 1.15× | 2.7% |
| MoE Q8 | 8 | 60.9 | 1.22× | 1.0% |
| ★ MoE Q8 | 16 | 64.8 | 1.30× | 0.2% |
| Dense 4B | 4 | 70.9 | 1.05× | 3.0% |
| Dense 4B | 8 | 65.9 | 0.98× | 1.1% |
| Dense 4B | 16 | 75.1 | 1.12× | 0.4% |
| Dense 9B | 4 | 32.8 | 0.98× | 2.3% |
| Dense 9B | 8 | 55.0 | 1.64× | 0.9% |
| ★ Dense 9B | 16 | 67.7 | 2.03× | 0.4% |
| Target目标模型 | γ | Draft 0.8B草稿 0.8B | Draft 2B草稿 2B | Winner胜出 | ||
|---|---|---|---|---|---|---|
| Speedup加速 | Acceptance接受率 | Speedup加速 | Acceptance接受率 | |||
| MoE Q4 | 4 | 1.15× | 2.6% | 1.03× | 4.0% | 0.8B ✓ |
| MoE Q4 | 8 | 1.18× | 1.1% | 1.05× | 1.7% | 0.8B ✓ |
| MoE Q4 | 16 | 1.26× | 0.2% | 1.12× | 0.9% | 0.8B ✓ |
| MoE Q8 | 4 | 1.15× | 2.7% | 1.03× | 3.6% | 0.8B ✓ |
| MoE Q8 | 8 | 1.22× | 1.0% | 1.09× | 2.0% | 0.8B ✓ |
| MoE Q8 | 16 | 1.30× | 0.2% | 1.16× | 0.9% | 0.8B ✓ |
| Component组件 | Specification规格 |
|---|---|
| Chip | Apple M2 Ultra |
| Unified Memory统一内存 | 192 GB (LPDDR5) |
| Memory Bandwidth内存带宽 | 800 GB/s |
| GPU Cores | 76 (Metal) |
| CPU Cores | 24 (16P + 8E) |
| Framework | llama.cpp v8240 / v8280 |
| Target Model目标模型 | Qwen3.5-35B-A3B (Q4_K_M / Q8_0) |
| Draft Models草稿模型 | Qwen3.5-0.8B (Q8), Qwen3.5-2B (Q8) |
| Total Runs总运行次数 | 306 (5 experiment suites) |
| Duration实验总时长 | ~41 minutes |
@misc{atomgradient2026sdmoe,
title = {Does Speculative Decoding Help Mixture-of-Experts?
An Empirical Study on Qwen3.5-35B-A3B},
author = {AtomGradient},
year = {2026},
url = {https://github.com/AtomGradient/speculative-moe-research}
}