Disaggregated LLM Inference on Apple SiliconApple Silicon 异构 LLM 推理
We compare five strategies for Qwen3.5 inference across Apple Silicon's compute units —
GPU-only (MLX), hybrid CoreML+GPU (CoreML prefill + MLX decode), direct ANE private API (ANE-LM),
ANE-LM Hybrid (sequential ANE prefill + MLX GPU decode), and ANE-LM Batch (batched ANE dispatch + MLX GPU decode).
Key finding: CoreML compute_units=ALL on macOS 26.3 routes to GPU, not ANE.
Apple Silicon exposes three compute units — GPU, Neural Engine (ANE), and CPU — on a unified
memory bus. We ask: can disaggregating LLM inference phases across different units improve
latency or throughput?
Apple Silicon 在统一内存总线上集成了 GPU、神经引擎(ANE)和 CPU 三个计算单元。
我们研究:将 LLM 推理的预填充与解码阶段分布到不同计算单元,能否改善延迟或吞吐量?
🖥️
MLX GPU BaselineMLX GPU 基准
Pure GPU inference via MLX/Metal. Dynamic shapes, lazy evaluation. Optimal for
autoregressive decode (bandwidth-bound). Four model variants benchmarked.
Batched prefill via CoreML, decode via MLX GPU. Requires a custom
KV-cache bridge for Qwen3.5's hybrid DeltaNet + full-attention architecture.
Note: On macOS 26.3, compute_units=ALL routes to GPU, not ANE (ANE power ≈ 0W).
Sequential ANE prefill + MLX GPU decode via a binary cache bridge. Achieves decode
parity with GPU baseline (67–70 tok/s) while reducing prefill GPU power by
282× (62.05 W → 0.22 W).
TTFT is limited by the private API's sequential dispatch (~42 ms/token), not ANE hardware —
CoreML batched prefill on the same ANE matches GPU speed.
CoreML + MLX-Swift both supportedCoreML + MLX-Swift 均支持
MLX-Swift supportedMLX-Swift 支持
† macOS 26.3: CPU_AND_NE triggers ANE IPC daemon deadlock; compute_units=ALL routes to GPU, not ANE (measured ANE power ≈ 0W across all CoreML hybrid runs).
‡ ANE power (~0.22 W GPU + ~1.58 W ANE) measured via ANE-LM private API prefill, which correctly dispatches to ANE hardware. CoreML does NOT use ANE on macOS 26.3.† macOS 26.3:CPU_AND_NE 会触发 ANE IPC 守护进程死锁;compute_units=ALL 路由到 GPU 而非 ANE(所有 CoreML 混合运行中 ANE 功耗 ≈ 0W)。
‡ ANE 功耗(~0.22 W GPU + ~1.58 W ANE)由 ANE-LM 私有 API 预填充实测,该 API 正确调度到 ANE 硬件。CoreML 在 macOS 26.3 上不使用 ANE。
🔴Cache Bridge (primary overhead) CoreML numpy → MLX array conversion. Zero-copy under unified memory, but DeltaNet state layout alignment has CPU compute cost on every prefill.缓存桥接(主要开销) CoreML numpy → MLX array 格式转换。统一内存下零拷贝,但 DeltaNet 状态布局对齐在每次预填充后有额外 CPU 计算开销。
⚠️Fixed seq_len waste A 32-token prompt still pads to 64, wasting ANE compute. Prompts over 512 tokens need larger model variants.固定 seq_len 浪费 32 token 的 prompt 也要填充到 64,浪费 ANE 算力。超过 512 token 的 prompt 需要更大模型变体。
🔴GPU prefill saturation (primary bottleneck) With long prompts (128+ tokens), prefill is compute-bound with GPU at full load. TTFT scales linearly with prompt length.GPU 预填充饱和(主要瓶颈) 长 prompt(128+ token)时,预填充计算受限,GPU 满负载运行。首字时延随 prompt 长度线性增长。
🔴Power and thermal (measured M2 Ultra) GPU prefill: 62.05 W — 282× more than ANE prefill (0.22 W GPU + 1.58 W ANE). On mobile (TDP ~3–8 W), immediately triggers thermal throttling.功耗与散热(M2 Ultra 实测) GPU 预填充:62.05 W,是 ANE 预填充(0.22 W GPU + 1.58 W ANE)的 282 倍。移动端(TDP ~3–8 W)会立即触发散热降频。
⚠️Decode is also memory-bound Per-token generation is bandwidth-limited. M1 Max (400 GB/s) becomes a bottleneck for large models (9B+).解码同样受内存带宽限制 逐 token 生成受带宽约束。M1 Max(400 GB/s)对大模型(9B+)会成为瓶颈。
✅Simple architecture, no overhead No cache bridge, no padding waste, native dynamic shape support. Lowest overall latency for short prompts.架构简单,无额外开销 无缓存桥接,无填充浪费,原生动态形状支持。短 prompt 场景整体延迟最低。
Results性能数据
Hardware: Apple M2 Ultra, 192 GB, 800 GB/s · Model: Qwen3.5 family · Greedy decoding, 200 tokens generated
Note: bars use log-scale equivalent visual widths for readability. All values end-to-end measured. ANE-LM Hybrid's 176× slower TTFT vs CoreML Hybrid is due to sequential dispatch (private API limitation), not ANE hardware — both use the same Neural Engine.
CoreML batched prefill + MLX GPU decode for the 2B BF16 model
2B BF16 模型的 CoreML 批量预填充 + MLX GPU 解码
Prompt提示词
Tokens
seq_len
TTFT (ms)
Decode (tok/s)解码 (tok/s)
Peak Mem (GB)峰值内存 (GB)
short短
6
64
22
104.2
4.16
medium中
133
256
54
102.3
4.37
long长
410
512
122
100.7
4.67
2B-BF16: Hybrid vs Baseline TTFT Comparison2B-BF16:混合推理 vs 基准 TTFT 对比
Prompt提示词
Tokens
Baseline TTFT (ms)基准 TTFT (ms)
Hybrid TTFT (ms)混合 TTFT (ms)
Ratio比率
short短
6
22
22
1.0× (equal)(持平)
medium中
133
54
54
1.0× (equal)(持平)
long长
410
123
122
0.99× (equal)(持平)
Unlike 0.8B where CoreML dispatch overhead added 200–340 ms, the 2B BF16 model shows exact TTFT match with GPU baseline at all prompt lengths. This is consistent with CoreML routing to GPU on macOS 26.3 (ANE power ≈ 0W) — both paths use the same GPU hardware. Decode throughput (100–104 tok/s) matches baseline (100–101 tok/s).
9B: Hybrid vs Baseline TTFT Comparison9B:混合推理 vs 基准 TTFT 对比
Prompt提示
TokensToken 数
Baseline GPU基准 GPU
Hybrid ANE混合 ANE
Ratio比值
short短
6
39
319
8.2× slower更慢
medium中
133
265
672
2.5× slower更慢
long长
410
625
1,265
2.0× slower更慢
Unlike 0.8B and 2B, the 9B hybrid approach shows no crossover point — it is always slower than GPU baseline. The 4-chunk CoreML dispatch (vs 1–2 chunks for smaller models) multiplies IPC overhead. Additionally, the mixed-precision cache bridge (FP16 CoreML → 8-bit MLX) causes 11–16% decode throughput degradation (47.6–50.0 vs baseline 56.1–56.5 tok/s).
Key result: ANE prefill saves 60.25 W GPU power (282× reduction).
On mobile devices with 3–8 W total TDP, GPU prefill at 62 W would immediately trigger thermal throttling;
ANE prefill at ~1.8 W fits well within thermal budget.
核心结论:ANE 预填充节省 60.25 W GPU 功耗(降低 282×)。移动端 TDP 仅 3–8 W,GPU 预填充的 62 W 会立即触发散热降频;ANE 预填充的 ~1.8 W 完全在热功耗预算以内。
Crossover Point — GPU Core Count vs. ANE Benefit交叉阈值 — GPU 核心数与 ANE 收益
The prompt length where ANE prefill matches GPU prefill speed scales with GPU core count.
Fewer GPU cores → lower crossover threshold → broader range of prompts benefit from ANE offloading.
Note: On macOS 26.3, these crossover points reflect CoreML GPU vs MLX GPU (not ANE vs GPU).
For genuine ANE prefill on mobile, use the ANE-LM private API batch dispatch which achieves 268 tok/s while drawing only 1.8W total.
注意:在 macOS 26.3 上,这些交叉阈值反映的是 CoreML GPU vs MLX GPU(而非 ANE vs GPU)。
要在移动端实现真正的 ANE 预填充,请使用 ANE-LM 私有 API 批量调度,可达 268 tok/s 且总功耗仅 1.8W。
Full Inference Power Consumption — Baseline MLX (GPU Only)完整推理功耗 — 基准 MLX(纯 GPU)
Measured via powermetrics/asitop during full inference (prefill + decode), 4 runs each. All models on M2 Ultra.
Full Inference Power Consumption — Hybrid ANE (CoreML Prefill + MLX Decode)完整推理功耗 — 混合 ANE(CoreML 预填充 + MLX 解码)
Model模型
Prompt提示词
CPU (W)
GPU (W)
ANE (W)
Total (W)总计 (W)
0.8B FP16
short短
7.4
14.6
0.024
22.0
0.8B FP16
medium中
10.5
5.2
0.002
15.7
0.8B FP16
long长
10.4
6.2
0.017
16.6
9B 8-bit
short短
8.1
31.3
0
39.4
9B 8-bit
medium中
9.5
21.5
0
31.0
9B 8-bit
long长
11.1
11.4
0
22.5
⚠️
Key Finding: ANE Power is 0 W关键发现:ANE 功耗为 0 W
ANE power is essentially 0 W across all hybrid runs — despite using compute_units=ALL,
CoreML routes computation through GPU, not ANE. The "ANE prefill" is a misnomer: CoreML is performing
GPU-based prefill with optimized kernels.
The hybrid pipeline's power savings come from CoreML's more efficient GPU kernel utilization,
not from ANE offloading:
• 0.8B long prompt: 16.6 W (hybrid) vs 25.7 W (baseline) — 35% reduction
• 9B long prompt: 22.5 W (hybrid) vs 53.2 W (baseline) — 58% reduction
• ANE power never exceeds 0.024 W in any hybrid configuration
This revises our earlier per-phase power data (measured via ANE-LM private API), which showed genuine ANE
utilization at 1.58 W. The private API dispatches directly to ANE hardware, whereas CoreML's
compute_units=ALL on macOS 26.3 appears to prefer GPU execution even when ANE is nominally available.
CoreML Routes to GPU on macOS 26.3macOS 26.3 上 CoreML 路由到 GPU
compute_units=ALL on macOS 26.3 routes computation to GPU, not ANE (ANE power ≈ 0W across all runs).
The 2B BF16 "zero overhead" finding is explained by both CoreML and MLX using the same GPU hardware.
For 0.8B: CoreML framework dispatch overhead (250–400 ms) still exists at short prompts;
at seq512, CoreML GPU matches MLX GPU throughput (4,128 tok/s).
ANE-LM sequential dispatch: ~24 tok/s (42 ms/token). Batch dispatch (32 tok/call):
268 tok/s (0.8B) and 173 tok/s (2B) — 11.3× and 7.3× speedup.
This proves ANE hardware can achieve high throughput when given batched input.
The sequential bottleneck is an API dispatch pattern, not hardware.
Hybrid decode speed (69–73 tok/s) matches or slightly exceeds baseline (69–71 tok/s),
confirming the cache bridge correctly transfers both DeltaNet recurrent states and
full-attention KV caches from CoreML to MLX.
First-time CoreML model load triggers on-device Metal+ANE kernel compilation:
seq64 ~103 s, seq256 ~50 min, seq512 ~97 min. Results are cached — subsequent
loads take seconds. Plan for one-time compilation cost per machine.
ANE prefill (via private API) measured at 0.22 W GPU + 1.58 W ANE = 1.8 W total, vs GPU prefill at
62.05 W — a 282× reduction. Saves 60.25 W per request. On mobile (TDP ~3–8 W),
GPU prefill would immediately trigger thermal throttling; ANE prefill fits within budget.
Note: This power data is from ANE-LM private API, not CoreML (which routes to GPU on macOS 26.3).
On M5, MLX accesses Neural Accelerators directly via Metal 4 Tensor API —
no CoreML dispatch layer needed. This achieves up to 4× TTFT improvement for
14B models vs. M4 [Apple ML Research, 2025], validating the prefill-compute hypothesis.
在 M5 上,MLX 通过 Metal 4 Tensor API 直接访问神经加速器,无需 CoreML 调度层。相比 M4 对 14B 模型可实现最高 4× 首字时延改善 [Apple ML Research, 2025],验证了预填充算力的核心假设。
Decode Throughput vs. Bandwidth — MLX GPU Baseline解码吞吐量 vs. 内存带宽 — MLX GPU 基准
All models, long prompt (410 tokens), M2 Ultra 800 GB/s
所有模型,长 prompt(410 token),M2 Ultra 800 GB/s
2B 8-bit — 2.0 GB weights2B 8-bit — 权重 2.0 GB
138.9 tok/s
2B BF16 — 4.0 GB weights2B BF16 — 权重 4.0 GB
99.7 tok/s
0.8B FP16 — 1.6 GB weights0.8B FP16 — 权重 1.6 GB
69.0 tok/s
9B 8-bit — 9.5 GB weights9B 8-bit — 权重 9.5 GB
56.5 tok/s
Hardware Comparison硬件对比
M1 Max → M2 Ultra: impact of doubling memory bandwidth