📋 March 2026 · M2 Ultra · M1 Max · M2 Pro · llama.cpp + MLX 📋 2026年3月 · M2 Ultra · M1 Max · M2 Pro · llama.cpp + MLX

Apple Silicon LLM Inference
Quantization, Speculative Decoding & Cross-Device Benchmarks
Apple Silicon LLM 推理:量化、推测采样与跨设备基准

From quantization ladders to speculative decoding: a systematic empirical study across three Apple Silicon machines. 从量化梯度到推测采样:覆盖三台 Apple Silicon 设备的系统性实验研究。

+25.7%
Speculative Decoding speedup (0.8B→9B) 推测采样提速(0.8B→9B)
0.18%
PPL degradation at Q8_0 (near-lossless) Q8_0 困惑度损失(近乎无损)
2.5×
Min draft/target speed ratio for SD benefit SD 获益所需最低速比
~79%
GGML_RPC cross-device SD overhead (RPC protocol, not network) GGML_RPC 跨设备 SD 开销(协议开销,非网络瓶颈)
↓ Download Paper (PDF) ↓ 下载论文 (PDF) GitHub

Key Findings 核心发现

Six results from systematic benchmarking across three Apple Silicon machines — with plain-language takeaways. 覆盖三台 Apple Silicon 设备的系统性基准测试,附通俗解读。

📊

Memory Bandwidth Drives Everything 内存带宽决定一切

LLM token generation is memory-bound, not compute-bound. M2 Ultra (800 GB/s) runs ~3.3× faster than M2 Pro (200 GB/s) on the same model. Quantization cuts model size → fewer bytes loaded per token → faster generation. LLM 生成 token 受内存带宽限制,而非算力。M2 Ultra(800 GB/s)在同一模型上比 M2 Pro(200 GB/s)快约 3.3 倍。量化压缩模型体积 → 每个 token 读取字节更少 → 生成更快。

In plain terms通俗来说
Think of the GPU as a factory and memory bandwidth as the conveyor belt feeding it. Halve the model size, roughly halve the load time — regardless of how fast the factory is. 把 GPU 想象成工厂,内存带宽是给它送料的传送带。模型体积减半,加载时间近似减半——无论工厂有多快。
📈

Q6_K is Pareto-Optimal Q6_K 是 Pareto 最优点

Q8_0 is near-lossless (0.18% PPL). Q6_K ★ dominates Q5_K_M on both speed and quality: 1.68× faster, 59% smaller, only 0.54% quality loss. Q4_K_M suits memory-constrained setups. Sub-4-bit is unusable (Q2_K: +267% PPL). Q8_0 近乎无损(0.18% PPL 损失)。Q6_K ★ 在速度和质量上均优于 Q5_K_M:快 1.68 倍、小 59%,质量损失仅 0.54%。Q4_K_M 适合内存受限场景。4bit 以下实际不可用(Q2_K:+267% PPL)。

Practical guide选择指南
Q8_0 — memory plentiful, near lossless  ·  Q6_K ★ — best balance  ·  Q4_K_M — tight memory  ·  Q2_K — avoid (output degenerates) Q8_0 — 内存充足,近乎无损  ·  Q6_K ★ — 最佳平衡  ·  Q4_K_M — 内存受限  ·  Q2_K — 避免使用(输出崩溃)

Speed Ratio > Acceptance Rate for SD 投机解码:速比比接受率更重要

SD benefit on Apple Silicon is governed by draft/target speed ratio. A 3.3× ratio yields +25.7% throughput even at 2–4% acceptance rate, due to Metal GPU batch-verification efficiency. Rule: draft must run ≥2.5× faster than target. 在 Apple Silicon 上,SD 的收益由 draft/target 速比决定。3.3× 速比在仅 2–4% 接受率下仍能实现 +25.7% 吞吐提升,得益于 Metal GPU 批处理验证效率。规则:草稿模型速度需 ≥2.5 倍于目标模型。

In plain terms通俗来说
Metal GPU verifies a batch of draft tokens almost as fast as one — so a fast draft model wins even with low acceptance. 0.8B draft (140 tok/s) + 9B target (42 tok/s) = 3.3× ratio → +25.7%. Metal GPU 验证一批草稿 token 的速度与验证单个几乎相同——快速的草稿模型即便接受率低也能盈利。0.8B 草稿(140 tok/s)+ 9B 目标(42 tok/s)= 3.3 倍 → +25.7%。

Self-Speculation Fails on Unified Memory 自推测采样在统一内存上不可行

Same-model-different-quant SD (4B Q2→Q8) achieves only 1.35× speed ratio, causing ~25% throughput loss. Requires ≥2.5× to break even. 同模型不同量化的 SD(4B Q2→Q8)速比仅 1.35×,导致约 25% 吞吐量下降。需 ≥2.5× 速比才能收支平衡。

In plain terms通俗来说
Both the draft and target model share the same memory bus — so they can't outrun each other enough to make speculation worthwhile. 草稿模型和目标模型共享同一内存总线,彼此无法拉开足够的速度差距,导致投机无利可图。
🆕

MoE + SD Conflict MoE 与 SD 存在冲突

Qwen3.5-35B-A3B MoE achieves 55.2 tok/s (only ~3B active params per token). Its sparse expert routing makes draft predictions unreliable; SD offers no benefit. Qwen3.5-35B-A3B MoE 达到 55.2 tok/s(每 token 仅激活 ~3B 参数)。稀疏 expert routing 使 draft 预测不可靠;SD 无益

In plain terms通俗来说
The 35B MoE already runs at small-model speeds because it activates so few parameters per token. There's no speed headroom left for a draft model to exploit. 35B MoE 因每个 token 激活参数极少,本身已接近小模型速度,没有剩余的速度空间让草稿模型利用。
💾

QAT Shows Smooth Gradients vs PTQ QAT 呈现平滑梯度,优于 PTQ

Gemma-3-4B QAT spans 96→137 tok/s (3→8 bit) with σ < 0.3 tok/s. PTQ Qwen3-8B shows σ≈19 tok/s variance at low bit-widths — cross-model comparison only; architectures differ. Gemma-3-4B QAT 从 8bit 到 3bit 吞吐从 96→137 tok/s,σ < 0.3 tok/s。PTQ Qwen3-8B 低精度下方差达 σ≈19 tok/s——仅跨模型对比,架构不同。

In plain terms通俗来说
QAT (quantization-aware training) learns to tolerate quantization during training — producing stable, predictable performance. PTQ (quantize after training) can be erratic at low bit-widths. QAT(量化感知训练)在训练时就适应量化,性能稳定可预测。PTQ(训练后量化)在低精度时可能表现不稳定。

Cross-Device SD: GGML_RPC Not Viable 跨设备 SD:GGML_RPC 不可用于生产

GGML_RPC cross-device SD incurs ~79–83% throughput reduction from per-op RPC protocol overhead (~51–57 ms/token), not network bandwidth. Initial BLAS-build results (−2%) were invalid due to silent local fallback. GGML_RPC 跨设备 SD 因逐 op RPC 协议开销(约 51–57 ms/token)导致 ~79–83% 吞吐下降,而非网络带宽瓶颈。最初的 BLAS 版结果(−2%)因本地静默回退而无效。

In plain terms通俗来说
The bottleneck is protocol design, not the network. GGML_RPC makes ~hundreds of round-trips per token. A streaming protocol that batches the whole forward pass into one call would fix this. 瓶颈在协议设计,而非网络。GGML_RPC 每个 token 发起数百次网络往返。若能将整个前向传播打包为一次调用,即可解决。

Benchmark Results 基准测试结果

Primary experiments on M2 Ultra 192 GB; hardware comparison across M2 Ultra, M1 Max, and M2 Pro. 主要实验在 M2 Ultra 192 GB 上运行;硬件对比覆盖 M2 Ultra、M1 Max 和 M2 Pro 三台机器。

Qwen3.5-4B: Quantization Throughput & Quality Qwen3.5-4B:量化精度吞吐量与质量

Quant精度 TG (tok/s)TG (tok/s) PP (tok/s)PP (tok/s) Size (GB)大小 (GB) TG SpeedupTG 加速 PPL困惑度 ΔPPL%ΔPPL%
F1636.81931.68.421.00×11.055
Q8_054.01373.64.481.47×11.0750.18%
Q6_K ★62.01358.93.461.68×11.1150.54%
Q5_K_M56.61276.13.111.54×11.2481.74%
Q4_K_M58.51337.62.711.59×11.5044.07%
Q3_K_M68.81611.92.261.87×12.66814.6%
Q2_K72.81700.91.801.98×40.602267%

Speculative Decoding: 9B Q8 Target (baseline 42.4 tok/s) 推测采样:9B Q8 目标模型(基线 42.4 tok/s)

Draft ModelDraft 模型 k Speed Ratio速比 Avg Accept%平均接受率 Avg TPS平均 TPS vs Baseline相对基线
0.8B Q8 ★43.31×3.3% 53.3+25.7%
0.8B Q883.31×1.3% 52.7+24.3%
2B Q842.61×3.4% 48.5+14.3%
2B Q882.61×1.4% 47.8+12.7%

Cross-Device SD via GGML_RPC (9B Q8 target on M2 Ultra, 0.8B Q8 draft, k=4, no-BLAS build) GGML_RPC 跨设备 SD(M2 Ultra 运行 9B Q8 目标模型,0.8B Q8 草稿模型,k=4,无 BLAS 编译)

Draft BackendDraft 后端 Avg TPS平均 TPS vs Local相对本地 Draft tok/s via RPCDraft RPC 速度
Local (M2 Ultra) 53.8 baseline 140.2 (local)
M1 Max (1 Gbps RPC) 11.2 −79.2% 16.0 (vs 86.6 local)
M2 Pro (1 Gbps RPC) 9.4 −82.6% 14.2 (vs 62.9 local)

Initial BLAS-build results showed −2% overhead but were invalid: the BLAS/Metal backend conflict caused silent local fallback (confirmed by identical acceptance rates to local baseline). No-BLAS build correctly routes draft to remote Metal GPU. Overhead is from per-op GGML_RPC protocol calls (~51–57 ms/token overhead), not network bandwidth. 最初的 BLAS 版结果显示 −2% 开销,但因 BLAS/Metal 后端冲突导致静默本地回退(接受率与本地基线完全相同可确认)。无 BLAS 版本正确将 draft 路由至远程 Metal GPU。开销来自逐 op 的 GGML_RPC 协议调用(约 51–57 ms/token),而非网络带宽限制。

Three-Machine Hardware Comparison (Qwen3.5-9B Q8_0) 三机硬件对比(Qwen3.5-9B Q8_0)

Machine机器 Mem BW内存带宽 TG (tok/s)TG (tok/s) PP (tok/s)PP (tok/s) BW Util带宽利用率
M2 Ultra (192 GB) 800 GB/s 42.4 1163.9 64%
M1 Max (32 GB) 400 GB/s 21.8 483.7 66%
M2 Pro (32 GB) 200 GB/s 12.7 321.3 77%

Self-Speculative Decoding & MoE (Failure Cases) 自推测采样与 MoE(失效场景)

Target目标模型 DraftDraft Ratio速比 Avg TPS平均 TPS Baseline基线 vs Baseline相对基线
4B Q84B Q21.35×40.454.0−25.2%
4B Q84B Q31.27×39.754.0−26.5%
4B Q84B Q41.08×40.454.0−25.2%
4B Q84B Q61.15×40.154.0−25.7%
35B MoE Q40.8B Q82.54×53.755.2−2.8%
35B MoE Q44B Q41.06×37.755.2−31.7%

Experimental Methodology 实验方法

Three experiments: quantization ladder, QAT vs PTQ, speculative decoding variants. 三组实验:量化梯度、QAT vs PTQ、推测采样变体。

Quantization Pipeline (llama.cpp)量化流程(llama.cpp)

1
Source来源: HuggingFace safetensors (Qwen3.5-4B bf16) HuggingFace safetensors(Qwen3.5-4B bf16)
2
Convert转换: convert_hf_to_gguf.py → GGUF F16 convert_hf_to_gguf.py → GGUF F16
3
Quantize量化: llama-quantize → Q2_K … Q8_0 llama-quantize → Q2_K … Q8_0
4
Benchmark测试: llama-bench (N_GEN=128, 3 runs) + llama-perplexity (WikiText-2) llama-bench(N_GEN=128, 3次)+ llama-perplexity(WikiText-2)

Speculative Decoding Setup推测采样配置

1
Tool工具: llama-speculative (GGML_RPC=ON rebuild) llama-speculative(GGML_RPC=ON 重编译)
2
Config配置: N_PREDICT=200, T=0 (greedy), k∈{4,8} N_PREDICT=200, T=0(贪婪),k∈{4,8}
3
Prompts提示词: 3 types: code / math / text 3 种类型:代码 / 数学 / 文本
4
Metric指标: Effective TPS = tokens / total_time (last occurrence) 有效 TPS = tokens / 总时间(取最后一次匹配)

Hardware 硬件规格

Three Apple Silicon machines: primary (M2 Ultra) + remote draft backends (M1 Max, M2 Pro). 三台 Apple Silicon 机器:主机(M2 Ultra)+ 远程 draft 后端(M1 Max、M2 Pro)。

M2 Ultra 192 GB (Primary)

Chip芯片Apple M2 Ultra
Memory内存192 GB Unified
Memory BW内存带宽800 GB/s
GPU CoresGPU 核心76
CPU CoresCPU 核心24
9B Q8 TG9B Q8 TG42.4 tok/s

M1 Max 32 GB (RPC Backend)

Chip芯片Apple M1 Max
Memory内存32 GB Unified
Memory BW内存带宽400 GB/s
GPU CoresGPU 核心32
CPU CoresCPU 核心10
9B Q8 TG9B Q8 TG21.8 tok/s

M2 Pro 32 GB (RPC Backend)

Chip芯片Apple M2 Pro
Memory内存32 GB Unified
Memory BW内存带宽200 GB/s
GPU CoresGPU 核心19
CPU CoresCPU 核心12
9B Q8 TG9B Q8 TG12.7 tok/s