From quantization ladders to speculative decoding: a systematic empirical study across three Apple Silicon machines. 从量化梯度到推测采样:覆盖三台 Apple Silicon 设备的系统性实验研究。
Six results from systematic benchmarking across three Apple Silicon machines — with plain-language takeaways. 覆盖三台 Apple Silicon 设备的系统性基准测试,附通俗解读。
LLM token generation is memory-bound, not compute-bound. M2 Ultra (800 GB/s) runs ~3.3× faster than M2 Pro (200 GB/s) on the same model. Quantization cuts model size → fewer bytes loaded per token → faster generation. LLM 生成 token 受内存带宽限制,而非算力。M2 Ultra(800 GB/s)在同一模型上比 M2 Pro(200 GB/s)快约 3.3 倍。量化压缩模型体积 → 每个 token 读取字节更少 → 生成更快。
Q8_0 is near-lossless (0.18% PPL). Q6_K ★ dominates Q5_K_M on both speed and quality: 1.68× faster, 59% smaller, only 0.54% quality loss. Q4_K_M suits memory-constrained setups. Sub-4-bit is unusable (Q2_K: +267% PPL). Q8_0 近乎无损(0.18% PPL 损失)。Q6_K ★ 在速度和质量上均优于 Q5_K_M:快 1.68 倍、小 59%,质量损失仅 0.54%。Q4_K_M 适合内存受限场景。4bit 以下实际不可用(Q2_K:+267% PPL)。
SD benefit on Apple Silicon is governed by draft/target speed ratio. A 3.3× ratio yields +25.7% throughput even at 2–4% acceptance rate, due to Metal GPU batch-verification efficiency. Rule: draft must run ≥2.5× faster than target. 在 Apple Silicon 上,SD 的收益由 draft/target 速比决定。3.3× 速比在仅 2–4% 接受率下仍能实现 +25.7% 吞吐提升,得益于 Metal GPU 批处理验证效率。规则:草稿模型速度需 ≥2.5 倍于目标模型。
Same-model-different-quant SD (4B Q2→Q8) achieves only 1.35× speed ratio, causing ~25% throughput loss. Requires ≥2.5× to break even. 同模型不同量化的 SD(4B Q2→Q8)速比仅 1.35×,导致约 25% 吞吐量下降。需 ≥2.5× 速比才能收支平衡。
Qwen3.5-35B-A3B MoE achieves 55.2 tok/s (only ~3B active params per token). Its sparse expert routing makes draft predictions unreliable; SD offers no benefit. Qwen3.5-35B-A3B MoE 达到 55.2 tok/s(每 token 仅激活 ~3B 参数)。稀疏 expert routing 使 draft 预测不可靠;SD 无益。
Gemma-3-4B QAT spans 96→137 tok/s (3→8 bit) with σ < 0.3 tok/s. PTQ Qwen3-8B shows σ≈19 tok/s variance at low bit-widths — cross-model comparison only; architectures differ. Gemma-3-4B QAT 从 8bit 到 3bit 吞吐从 96→137 tok/s,σ < 0.3 tok/s。PTQ Qwen3-8B 低精度下方差达 σ≈19 tok/s——仅跨模型对比,架构不同。
GGML_RPC cross-device SD incurs ~79–83% throughput reduction from per-op RPC protocol overhead (~51–57 ms/token), not network bandwidth. Initial BLAS-build results (−2%) were invalid due to silent local fallback. GGML_RPC 跨设备 SD 因逐 op RPC 协议开销(约 51–57 ms/token)导致 ~79–83% 吞吐下降,而非网络带宽瓶颈。最初的 BLAS 版结果(−2%)因本地静默回退而无效。
Primary experiments on M2 Ultra 192 GB; hardware comparison across M2 Ultra, M1 Max, and M2 Pro. 主要实验在 M2 Ultra 192 GB 上运行;硬件对比覆盖 M2 Ultra、M1 Max 和 M2 Pro 三台机器。
| Quant精度 | TG (tok/s)TG (tok/s) | PP (tok/s)PP (tok/s) | Size (GB)大小 (GB) | TG SpeedupTG 加速 | PPL困惑度 | ΔPPL%ΔPPL% |
|---|---|---|---|---|---|---|
| F16 | 36.8 | 1931.6 | 8.42 | 1.00× | 11.055 | — |
| Q8_0 | 54.0 | 1373.6 | 4.48 | 1.47× | 11.075 | 0.18% |
| Q6_K ★ | 62.0 | 1358.9 | 3.46 | 1.68× | 11.115 | 0.54% |
| Q5_K_M | 56.6 | 1276.1 | 3.11 | 1.54× | 11.248 | 1.74% |
| Q4_K_M | 58.5 | 1337.6 | 2.71 | 1.59× | 11.504 | 4.07% |
| Q3_K_M | 68.8 | 1611.9 | 2.26 | 1.87× | 12.668 | 14.6% |
| Q2_K | 72.8 | 1700.9 | 1.80 | 1.98× | 40.602 | 267% |
| Draft ModelDraft 模型 | k | Speed Ratio速比 | Avg Accept%平均接受率 | Avg TPS平均 TPS | vs Baseline相对基线 |
|---|---|---|---|---|---|
| 0.8B Q8 ★ | 4 | 3.31× | 3.3% | 53.3 | +25.7% |
| 0.8B Q8 | 8 | 3.31× | 1.3% | 52.7 | +24.3% |
| 2B Q8 | 4 | 2.61× | 3.4% | 48.5 | +14.3% |
| 2B Q8 | 8 | 2.61× | 1.4% | 47.8 | +12.7% |
| Draft BackendDraft 后端 | Avg TPS平均 TPS | vs Local相对本地 | Draft tok/s via RPCDraft RPC 速度 |
|---|---|---|---|
| Local (M2 Ultra) | 53.8 | baseline | 140.2 (local) |
| M1 Max (1 Gbps RPC) | 11.2 | −79.2% | 16.0 (vs 86.6 local) |
| M2 Pro (1 Gbps RPC) | 9.4 | −82.6% | 14.2 (vs 62.9 local) |
Initial BLAS-build results showed −2% overhead but were invalid: the BLAS/Metal backend conflict caused silent local fallback (confirmed by identical acceptance rates to local baseline). No-BLAS build correctly routes draft to remote Metal GPU. Overhead is from per-op GGML_RPC protocol calls (~51–57 ms/token overhead), not network bandwidth. 最初的 BLAS 版结果显示 −2% 开销,但因 BLAS/Metal 后端冲突导致静默本地回退(接受率与本地基线完全相同可确认)。无 BLAS 版本正确将 draft 路由至远程 Metal GPU。开销来自逐 op 的 GGML_RPC 协议调用(约 51–57 ms/token),而非网络带宽限制。
| Machine机器 | Mem BW内存带宽 | TG (tok/s)TG (tok/s) | PP (tok/s)PP (tok/s) | BW Util带宽利用率 |
|---|---|---|---|---|
| M2 Ultra (192 GB) | 800 GB/s | 42.4 | 1163.9 | 64% |
| M1 Max (32 GB) | 400 GB/s | 21.8 | 483.7 | 66% |
| M2 Pro (32 GB) | 200 GB/s | 12.7 | 321.3 | 77% |
| Target目标模型 | DraftDraft | Ratio速比 | Avg TPS平均 TPS | Baseline基线 | vs Baseline相对基线 |
|---|---|---|---|---|---|
| 4B Q8 | 4B Q2 | 1.35× | 40.4 | 54.0 | −25.2% |
| 4B Q8 | 4B Q3 | 1.27× | 39.7 | 54.0 | −26.5% |
| 4B Q8 | 4B Q4 | 1.08× | 40.4 | 54.0 | −25.2% |
| 4B Q8 | 4B Q6 | 1.15× | 40.1 | 54.0 | −25.7% |
| 35B MoE Q4 | 0.8B Q8 | 2.54× | 53.7 | 55.2 | −2.8% |
| 35B MoE Q4 | 4B Q4 | 1.06× | 37.7 | 55.2 | −31.7% |
Three experiments: quantization ladder, QAT vs PTQ, speculative decoding variants. 三组实验:量化梯度、QAT vs PTQ、推测采样变体。
Three Apple Silicon machines: primary (M2 Ultra) + remote draft backends (M1 Max, M2 Pro). 三台 Apple Silicon 机器:主机(M2 Ultra)+ 远程 draft 后端(M1 Max、M2 Pro)。