Apple Silicon LLM Inference: Quantization, Speculative Decoding & Cross-Device Benchmarks

Key Findings 核心发现

Six results from systematic benchmarking across three Apple Silicon machines — with plain-language takeaways. 覆盖三台 Apple Silicon 设备的系统性基准测试，附通俗解读。

📊

Memory Bandwidth Drives Everything 内存带宽决定一切

LLM token generation is memory-bound, not compute-bound. M2 Ultra (800 GB/s) runs ~3.3× faster than M2 Pro (200 GB/s) on the same model. Quantization cuts model size → fewer bytes loaded per token → faster generation. LLM 生成 token 受内存带宽限制，而非算力。M2 Ultra（800 GB/s）在同一模型上比 M2 Pro（200 GB/s）快约 3.3 倍。量化压缩模型体积 → 每个 token 读取字节更少 → 生成更快。

In plain terms通俗来说

Think of the GPU as a factory and memory bandwidth as the conveyor belt feeding it. Halve the model size, roughly halve the load time — regardless of how fast the factory is. 把 GPU 想象成工厂，内存带宽是给它送料的传送带。模型体积减半，加载时间近似减半——无论工厂有多快。

📈

Q6_K is Pareto-Optimal Q6_K 是 Pareto 最优点

Q8_0 is near-lossless (0.18% PPL). Q6_K ★ dominates Q5_K_M on both speed and quality: 1.68× faster, 59% smaller, only 0.54% quality loss. Q4_K_M suits memory-constrained setups. Sub-4-bit is unusable (Q2_K: +267% PPL). Q8_0 近乎无损（0.18% PPL 损失）。Q6_K ★ 在速度和质量上均优于 Q5_K_M：快 1.68 倍、小 59%，质量损失仅 0.54%。Q4_K_M 适合内存受限场景。4bit 以下实际不可用（Q2_K：+267% PPL）。

Practical guide选择指南

Q8_0 — memory plentiful, near lossless · Q6_K ★ — best balance · Q4_K_M — tight memory · Q2_K — avoid (output degenerates) Q8_0 — 内存充足，近乎无损 · Q6_K ★ — 最佳平衡 · Q4_K_M — 内存受限 · Q2_K — 避免使用（输出崩溃）

⚡

Speed Ratio > Acceptance Rate for SD 投机解码：速比比接受率更重要

SD benefit on Apple Silicon is governed by draft/target speed ratio. A 3.3× ratio yields +25.7% throughput even at 2–4% acceptance rate, due to Metal GPU batch-verification efficiency. Rule: draft must run ≥2.5× faster than target. 在 Apple Silicon 上，SD 的收益由 draft/target 速比决定。3.3× 速比在仅 2–4% 接受率下仍能实现 +25.7% 吞吐提升，得益于 Metal GPU 批处理验证效率。规则：草稿模型速度需 ≥2.5 倍于目标模型。

In plain terms通俗来说

Metal GPU verifies a batch of draft tokens almost as fast as one — so a fast draft model wins even with low acceptance. 0.8B draft (140 tok/s) + 9B target (42 tok/s) = 3.3× ratio → +25.7%. Metal GPU 验证一批草稿 token 的速度与验证单个几乎相同——快速的草稿模型即便接受率低也能盈利。0.8B 草稿（140 tok/s）+ 9B 目标（42 tok/s）= 3.3 倍 → +25.7%。

❌

Self-Speculation Fails on Unified Memory 自推测采样在统一内存上不可行

Same-model-different-quant SD (4B Q2→Q8) achieves only 1.35× speed ratio, causing ~25% throughput loss. Requires ≥2.5× to break even. 同模型不同量化的 SD（4B Q2→Q8）速比仅 1.35×，导致约 25% 吞吐量下降。需 ≥2.5× 速比才能收支平衡。

In plain terms通俗来说

Both the draft and target model share the same memory bus — so they can't outrun each other enough to make speculation worthwhile. 草稿模型和目标模型共享同一内存总线，彼此无法拉开足够的速度差距，导致投机无利可图。

🆕

MoE + SD Conflict MoE 与 SD 存在冲突

Qwen3.5-35B-A3B MoE achieves 55.2 tok/s (only ~3B active params per token). Its sparse expert routing makes draft predictions unreliable; SD offers no benefit. Qwen3.5-35B-A3B MoE 达到 55.2 tok/s（每 token 仅激活 ~3B 参数）。稀疏 expert routing 使 draft 预测不可靠；SD 无益。

In plain terms通俗来说

The 35B MoE already runs at small-model speeds because it activates so few parameters per token. There's no speed headroom left for a draft model to exploit. 35B MoE 因每个 token 激活参数极少，本身已接近小模型速度，没有剩余的速度空间让草稿模型利用。

💾

QAT Shows Smooth Gradients vs PTQ QAT 呈现平滑梯度，优于 PTQ

Gemma-3-4B QAT spans 96→137 tok/s (3→8 bit) with σ < 0.3 tok/s. PTQ Qwen3-8B shows σ≈19 tok/s variance at low bit-widths — cross-model comparison only; architectures differ. Gemma-3-4B QAT 从 8bit 到 3bit 吞吐从 96→137 tok/s，σ < 0.3 tok/s。PTQ Qwen3-8B 低精度下方差达 σ≈19 tok/s——仅跨模型对比，架构不同。

In plain terms通俗来说

QAT (quantization-aware training) learns to tolerate quantization during training — producing stable, predictable performance. PTQ (quantize after training) can be erratic at low bit-widths. QAT（量化感知训练）在训练时就适应量化，性能稳定可预测。PTQ（训练后量化）在低精度时可能表现不稳定。

⚠

Cross-Device SD: GGML_RPC Not Viable 跨设备 SD：GGML_RPC 不可用于生产

GGML_RPC cross-device SD incurs ~79–83% throughput reduction from per-op RPC protocol overhead (~51–57 ms/token), not network bandwidth. Initial BLAS-build results (−2%) were invalid due to silent local fallback. GGML_RPC 跨设备 SD 因逐 op RPC 协议开销（约 51–57 ms/token）导致 ~79–83% 吞吐下降，而非网络带宽瓶颈。最初的 BLAS 版结果（−2%）因本地静默回退而无效。

In plain terms通俗来说

The bottleneck is protocol design, not the network. GGML_RPC makes ~hundreds of round-trips per token. A streaming protocol that batches the whole forward pass into one call would fix this. 瓶颈在协议设计，而非网络。GGML_RPC 每个 token 发起数百次网络往返。若能将整个前向传播打包为一次调用，即可解决。

Benchmark Results 基准测试结果

Primary experiments on M2 Ultra 192 GB; hardware comparison across M2 Ultra, M1 Max, and M2 Pro. 主要实验在 M2 Ultra 192 GB 上运行；硬件对比覆盖 M2 Ultra、M1 Max 和 M2 Pro 三台机器。

Qwen3.5-4B: Quantization Throughput & Quality Qwen3.5-4B：量化精度吞吐量与质量

Quant精度	TG (tok/s)TG (tok/s)	PP (tok/s)PP (tok/s)	Size (GB)大小 (GB)	TG SpeedupTG 加速	PPL困惑度	ΔPPL%ΔPPL%
F16	36.8	1931.6	8.42	1.00×	11.055	—
Q8_0	54.0	1373.6	4.48	1.47×	11.075	0.18%
Q6_K ★	62.0	1358.9	3.46	1.68×	11.115	0.54%
Q5_K_M	56.6	1276.1	3.11	1.54×	11.248	1.74%
Q4_K_M	58.5	1337.6	2.71	1.59×	11.504	4.07%
Q3_K_M	68.8	1611.9	2.26	1.87×	12.668	14.6%
Q2_K	72.8	1700.9	1.80	1.98×	40.602	267%

Speculative Decoding: 9B Q8 Target (baseline 42.4 tok/s) 推测采样：9B Q8 目标模型（基线 42.4 tok/s）

Draft ModelDraft 模型	k	Speed Ratio速比	Avg Accept%平均接受率	Avg TPS平均 TPS	vs Baseline相对基线
0.8B Q8 ★	4	3.31×	3.3%	53.3	+25.7%
0.8B Q8	8	3.31×	1.3%	52.7	+24.3%
2B Q8	4	2.61×	3.4%	48.5	+14.3%
2B Q8	8	2.61×	1.4%	47.8	+12.7%

Cross-Device SD via GGML_RPC (9B Q8 target on M2 Ultra, 0.8B Q8 draft, k=4, no-BLAS build) GGML_RPC 跨设备 SD（M2 Ultra 运行 9B Q8 目标模型，0.8B Q8 草稿模型，k=4，无 BLAS 编译）

Draft BackendDraft 后端	Avg TPS平均 TPS	vs Local相对本地	Draft tok/s via RPCDraft RPC 速度
Local (M2 Ultra)	53.8	baseline	140.2 (local)
M1 Max (1 Gbps RPC)	11.2	−79.2%	16.0 (vs 86.6 local)
M2 Pro (1 Gbps RPC)	9.4	−82.6%	14.2 (vs 62.9 local)

Initial BLAS-build results showed −2% overhead but were invalid: the BLAS/Metal backend conflict caused silent local fallback (confirmed by identical acceptance rates to local baseline). No-BLAS build correctly routes draft to remote Metal GPU. Overhead is from per-op GGML_RPC protocol calls (~51–57 ms/token overhead), not network bandwidth. 最初的 BLAS 版结果显示 −2% 开销，但因 BLAS/Metal 后端冲突导致静默本地回退（接受率与本地基线完全相同可确认）。无 BLAS 版本正确将 draft 路由至远程 Metal GPU。开销来自逐 op 的 GGML_RPC 协议调用（约 51–57 ms/token），而非网络带宽限制。

Three-Machine Hardware Comparison (Qwen3.5-9B Q8_0) 三机硬件对比（Qwen3.5-9B Q8_0）

Machine机器	Mem BW内存带宽	TG (tok/s)TG (tok/s)	PP (tok/s)PP (tok/s)	BW Util带宽利用率
M2 Ultra (192 GB)	800 GB/s	42.4	1163.9	64%
M1 Max (32 GB)	400 GB/s	21.8	483.7	66%
M2 Pro (32 GB)	200 GB/s	12.7	321.3	77%

Self-Speculative Decoding & MoE (Failure Cases) 自推测采样与 MoE（失效场景）

Target目标模型	DraftDraft	Ratio速比	Avg TPS平均 TPS	Baseline基线	vs Baseline相对基线
4B Q8	4B Q2	1.35×	40.4	54.0	−25.2%
4B Q8	4B Q3	1.27×	39.7	54.0	−26.5%
4B Q8	4B Q4	1.08×	40.4	54.0	−25.2%
4B Q8	4B Q6	1.15×	40.1	54.0	−25.7%
35B MoE Q4	0.8B Q8	2.54×	53.7	55.2	−2.8%
35B MoE Q4	4B Q4	1.06×	37.7	55.2	−31.7%

Apple Silicon LLM Inference
Quantization, Speculative Decoding & Cross-Device Benchmarks Apple Silicon LLM 推理：量化、推测采样与跨设备基准

Key Findings 核心发现

Memory Bandwidth Drives Everything 内存带宽决定一切

Q6_K is Pareto-Optimal Q6_K 是 Pareto 最优点

Speed Ratio > Acceptance Rate for SD 投机解码：速比比接受率更重要

Self-Speculation Fails on Unified Memory 自推测采样在统一内存上不可行

MoE + SD Conflict MoE 与 SD 存在冲突

QAT Shows Smooth Gradients vs PTQ QAT 呈现平滑梯度，优于 PTQ

Cross-Device SD: GGML_RPC Not Viable 跨设备 SD：GGML_RPC 不可用于生产

Benchmark Results 基准测试结果

Qwen3.5-4B: Quantization Throughput & Quality Qwen3.5-4B：量化精度吞吐量与质量

Speculative Decoding: 9B Q8 Target (baseline 42.4 tok/s) 推测采样：9B Q8 目标模型（基线 42.4 tok/s）

Cross-Device SD via GGML_RPC (9B Q8 target on M2 Ultra, 0.8B Q8 draft, k=4, no-BLAS build) GGML_RPC 跨设备 SD（M2 Ultra 运行 9B Q8 目标模型，0.8B Q8 草稿模型，k=4，无 BLAS 编译）

Three-Machine Hardware Comparison (Qwen3.5-9B Q8_0) 三机硬件对比（Qwen3.5-9B Q8_0）

Self-Speculative Decoding & MoE (Failure Cases) 自推测采样与 MoE（失效场景）

Experimental Methodology 实验方法

Quantization Pipeline (llama.cpp)量化流程（llama.cpp）

Speculative Decoding Setup推测采样配置

Hardware 硬件规格

M2 Ultra 192 GB (Primary)

M1 Max 32 GB (RPC Backend)

M2 Pro 32 GB (RPC Backend)

Apple Silicon LLM InferenceQuantization, Speculative Decoding & Cross-Device Benchmarks Apple Silicon LLM 推理：量化、推测采样与跨设备基准

Key Findings 核心发现

Memory Bandwidth Drives Everything 内存带宽决定一切

Q6_K is Pareto-Optimal Q6_K 是 Pareto 最优点

Speed Ratio > Acceptance Rate for SD 投机解码：速比比接受率更重要

Self-Speculation Fails on Unified Memory 自推测采样在统一内存上不可行

MoE + SD Conflict MoE 与 SD 存在冲突

QAT Shows Smooth Gradients vs PTQ QAT 呈现平滑梯度，优于 PTQ

Cross-Device SD: GGML_RPC Not Viable 跨设备 SD：GGML_RPC 不可用于生产

Benchmark Results 基准测试结果

Qwen3.5-4B: Quantization Throughput & Quality Qwen3.5-4B：量化精度吞吐量与质量

Speculative Decoding: 9B Q8 Target (baseline 42.4 tok/s) 推测采样：9B Q8 目标模型（基线 42.4 tok/s）

Cross-Device SD via GGML_RPC (9B Q8 target on M2 Ultra, 0.8B Q8 draft, k=4, no-BLAS build) GGML_RPC 跨设备 SD（M2 Ultra 运行 9B Q8 目标模型，0.8B Q8 草稿模型，k=4，无 BLAS 编译）

Three-Machine Hardware Comparison (Qwen3.5-9B Q8_0) 三机硬件对比（Qwen3.5-9B Q8_0）

Self-Speculative Decoding & MoE (Failure Cases) 自推测采样与 MoE（失效场景）

Experimental Methodology 实验方法

Quantization Pipeline (llama.cpp)量化流程（llama.cpp）

Speculative Decoding Setup推测采样配置

Hardware 硬件规格

M2 Ultra 192 GB (Primary)

M1 Max 32 GB (RPC Backend)

M2 Pro 32 GB (RPC Backend)

Apple Silicon LLM Inference
Quantization, Speculative Decoding & Cross-Device Benchmarks Apple Silicon LLM 推理：量化、推测采样与跨设备基准