Hybrid ANE-MLX-Bench — Disaggregated LLM Inference on Apple Silicon

Overview概述

Apple Silicon exposes three compute units — GPU, Neural Engine (ANE), and CPU — on a unified memory bus. We ask: can disaggregating LLM inference phases across different units improve latency or throughput?

Apple Silicon 在统一内存总线上集成了 GPU、神经引擎（ANE）和 CPU 三个计算单元。我们研究：将 LLM 推理的预填充与解码阶段分布到不同计算单元，能否改善延迟或吞吐量？

🖥️

MLX GPU BaselineMLX GPU 基准

Pure GPU inference via MLX/Metal. Dynamic shapes, lazy evaluation. Optimal for autoregressive decode (bandwidth-bound). Four model variants benchmarked.

通过 MLX/Metal 实现纯 GPU 推理。动态形状，延迟求值。最适合自回归解码（带宽受限）。测试了四个模型变体。

⚡

Hybrid CoreML + MLX混合 CoreML + MLX

Batched prefill via CoreML (targeting ANE), decode via MLX GPU. Requires a custom KV-cache bridge for Qwen3.5's hybrid DeltaNet + full-attention architecture.

通过 CoreML（面向 ANE）进行批量预填充，通过 MLX GPU 解码。需要为 Qwen3.5 的混合 DeltaNet + 全注意力架构实现自定义 KV 缓存桥接。

🔬

ANE-LM (Private API)ANE-LM（私有 API）

Sequential per-token ANE dispatch via AppleNeuralEngine.framework private APIs. Establishes the single-token ANE dispatch latency floor (~42 ms/token).

通过 AppleNeuralEngine.framework 私有 API 进行逐 token ANE 调度。建立单 token ANE 调度延迟下限（~42 ms/token）。

🔋

ANE-LM HybridANE-LM 混合推理

Sequential ANE prefill + MLX GPU decode via a binary cache bridge. Achieves decode parity with GPU baseline (67–70 tok/s) while reducing prefill GPU power by 282× (62.05 W → 0.22 W). TTFT is limited by the private API's sequential dispatch (~42 ms/token), not ANE hardware — CoreML batched prefill on the same ANE matches GPU speed.

通过二进制缓存桥接实现 ANE 顺序预填充 + MLX GPU 解码。解码速度与 GPU 基准持平（67–70 tok/s），同时将预填充 GPU 功耗降低 282×（62.05 W → 0.22 W）。 TTFT 受限于私有 API 的逐 token 调度（~42 ms/token），而非 ANE 硬件——CoreML 批量预填充在同一 ANE 上可匹敌 GPU 速度。

Approach技术方案

Architecture of the four inference pipelines

四种推理管线的架构

Pipeline 1 — MLX GPU (Baseline)管线 1 — MLX GPU（基准）

Prefill (GPU)预填充（GPU）

Batched forward pass on all prompt tokens via Metal. Dynamic shapes, lazy eval.

所有 prompt token 通过 Metal 批量前向传播。动态形状，延迟求值。

→

Decode (GPU)解码（GPU）

Autoregressive token generation. Bandwidth-bound. KV cache grows dynamically.

自回归 token 生成。带宽受限。KV 缓存动态增长。

Pipeline 2 — Hybrid CoreML + MLX管线 2 — 混合 CoreML + MLX

Prefill (ANE via CoreML)预填充（ANE via CoreML）

Batched forward pass. Fixed seq_len (64/256/512). Left-padded. Outputs cache as numpy arrays.

批量前向传播。固定 seq_len（64/256/512）。左填充。以 numpy 数组输出缓存。

→

Cache Bridge缓存桥接

DeltaNet → ArraysCache (transpose + trim). Full-attn → KVCache (offset = prompt_len).

DeltaNet → ArraysCache（转置 + 裁剪）。全注意力 → KVCache（offset = prompt_len）。

→

Decode (GPU via MLX)解码（GPU via MLX）

Same GPU decode path as baseline. Cache pre-populated by ANE prefill.

与基准相同的 GPU 解码路径。缓存由 ANE 预填充预先填充。

Pipeline 3 — ANE-LM (Private API)管线 3 — ANE-LM（私有 API）

Sequential Prefill (ANE)顺序预填充（ANE）

One token per ANE dispatch call. ~42 ms/token. No batching. TTFT = N_tokens × 42 ms.

每次 ANE 调度处理一个 token。~42 ms/token。无批处理。TTFT = N_tokens × 42 ms。

→

Sequential Decode (ANE + CPU)顺序解码（ANE + CPU）

Matmuls on ANE, attention/norm/sampling on CPU. ~23–24 tok/s (3× slower than GPU).

矩阵乘法在 ANE，注意力/归一化/采样在 CPU。~23–24 tok/s（比 GPU 慢 3×）。

Pipeline 4 — ANE-LM Hybrid管线 4 — ANE-LM 混合推理

Sequential Prefill (ANE)顺序预填充（ANE）

Same as ANE-LM: one token per ANE dispatch call. ~42 ms/token. TTFT identical to Pipeline 3. GPU power: 0.22 W vs 62.05 W for GPU prefill.

与 ANE-LM 相同：每次 ANE 调度处理一个 token。~42 ms/token。TTFT 与管线 3 相同。GPU 功耗：0.22 W vs GPU 预填充的 62.05 W。

→

Cache Bridge (CPU)缓存桥接（CPU）

Binary cache file from ANE-LM subprocess. Reads KV caches and DeltaNet recurrent states, converts layout to MLX format (zero-copy, unified memory).

ANE-LM 子进程生成的二进制缓存文件。读取 KV 缓存和 DeltaNet 循环状态，转换为 MLX 格式（零拷贝，统一内存）。

→

Decode (GPU via MLX)解码（GPU via MLX）

Full MLX GPU decode path. Measured: 67–70 tok/s, matching GPU baseline. 3× faster than ANE-LM pure decode.

完整的 MLX GPU 解码路径。实测：67–70 tok/s，与 GPU 基准持平。比 ANE-LM 纯解码快 3×。

Architecture Comparison架构对比

Data-flow side-by-side: Hybrid ANE+MLX vs pure MLX GPU baseline

数据流并排对比：混合 ANE+MLX 与纯 MLX GPU 基准

🔀 Hybrid Inference (ANE + MLX) 🔀 混合推理（ANE + MLX）

Input输入

Prompt Text提示词

Tokenizer → input_ids + attention_mask
Padded to fixed seq_len [64 / 256 / 512]

Tokenizer → input_ids + attention_mask
填充到固定 seq_len [64 / 256 / 512]

⚠ Must pad to fixed seq_len ⚠ 必须填充到固定 seq_len

ANE · CoreML

Prefill — compute-bound预填充 — 计算受限

compute_units = ALL (macOS 26: CPU_AND_NE causes ANE IPC deadlock)
All prompt tokens processed in parallel
Output: logits + hybrid KV / DeltaNet state

compute_units = ALL（macOS 26：CPU_AND_NE 会导致 ANE IPC 死锁）
所有 prompt token 并行处理
输出：logits + 混合 KV / DeltaNet 状态

✓ Low power — ANE hardware draws ~1–2 W vs GPU's 62 W† ✓ 低功耗 — ANE 约 1–2 W，GPU 高达 62 W†

cache bridge

CPU · Format ConversionCPU · 格式转换

Cache Bridge — core overhead缓存桥接 — 核心开销

CoreML → np.ndarray → mx.array (zero-copy, unified memory)
full_attn KV: offset = prompt_len
DeltaNet state: layout alignment for mlx_lm

CoreML → np.ndarray → mx.array（零拷贝，统一内存）
全注意力 KV：offset = prompt_len
DeltaNet 状态：布局对齐至 mlx_lm

🔴 Overhead on every prefill 🔴 每次预填充均有转换开销

GPU · MLX

Decode — memory-bandwidth-bound解码 — 内存带宽受限

Autoregressive token generation
Dynamic shapes, no padding · Lazy evaluation, Metal GPU

自回归 token 生成
动态形状，无需填充 · 延迟求值，Metal GPU

✓ Low latency, minimal dispatch overhead ✓ 低延迟，调度开销极小

Output输出

Generated Text生成文本

Tokenizer.decode(generated_ids)

⚡ Pure MLX (Baseline) ⚡ 纯 MLX（基准）

Input输入

Prompt Text提示词

Tokenizer → input_ids
Dynamic length, no padding required

Tokenizer → input_ids
动态长度，无需填充

✓ No padding overhead ✓ 无填充开销

GPU · MLX

Prefill — compute-bound · bottleneck预填充 — 计算受限 · 瓶颈

All prompt tokens sent to GPU in parallel
Long prompts: GPU fully loaded at 62.05 W (measured)
Triggers thermal throttling on mobile devices

所有 prompt token 并行送入 GPU
长 prompt：GPU 满负载 62.05 W（实测）
移动端会立即触发散热降频

🔴 High TTFT + high power at long prompts 🔴 长 prompt 首字时延高 + 高功耗

GPU · MLX

Decode — memory-bandwidth-bound解码 — 内存带宽受限

Autoregressive token generation
Dynamic shapes, lazy evaluation
KV cache updated in-place

自回归 token 生成
动态形状，延迟求值
KV 缓存原地更新

✓ No format conversion overhead ✓ 无格式转换开销

Output输出

Generated Text生成文本

Tokenizer.decode(generated_ids)

📊 Phase-by-Phase Comparison 📊 阶段对比分析

Phase / Metric阶段 / 指标	Hybrid Inference (ANE + MLX)混合推理（ANE + MLX）	Pure MLX纯 MLX
Prefill hardware预填充硬件	ANE (CoreML, compute_units=ALL†)ANE（CoreML，compute_units=ALL†）	GPU (Metal)GPU（Metal）
TTFT (long prompt)首字时延（长 prompt）	Lower — ANE parallel compute更低 — ANE 并行算力	Higher — GPU compute-bound更高 — GPU 计算受限
Prefill power (measured‡)预填充功耗（实测‡）	GPU ~0.22 W + ANE ~1.58 W ≈ 1.8 W	GPU 62.05 W (282×)
Decode hardware解码硬件	GPU (MLX)	GPU (MLX)
Decode speed解码速度	= Identical (same GPU path)= 相同（同一 GPU 路径）	= Identical= 相同
Prompt length limitPrompt 长度限制	Must pad to fixed slots: 64 / 256 / 512⚠ 必须填充到固定档位：64 / 256 / 512	Dynamic, unlimited动态，无限制
Cache bridge缓存桥接	CPU format conversion (DeltaNet complex)CPU 格式转换（DeltaNet 较复杂）	None required无需转换
CoreML cold startCoreML 冷启动	seq64: ~103 s · seq256: ~50 min · seq512: ~97 min	None无
Architecture complexity架构复杂度	High — DeltaNet hybrid cache adaptation高 — DeltaNet 混合缓存适配	Low — mlx_lm native support低 — mlx_lm 原生支持
Storage overhead存储开销	HF 4.5 GB + CoreML 5.8 GB = 10.3 GB totalHF 4.5 GB + CoreML 5.8 GB = 总计 10.3 GB	HF model only: 4.5 GB仅 HF 模型：4.5 GB
iOS/iPadOS supportiOS/iPadOS 支持	CoreML + MLX-Swift both supportedCoreML + MLX-Swift 均支持	MLX-Swift supportedMLX-Swift 支持

† macOS 26.3: CPU_AND_NE triggers ANE IPC daemon deadlock; compute_units=ALL must be used instead.
‡ ANE power (~0.22 W GPU + ~1.58 W ANE) measured via ANE-LM private API prefill. CoreML hybrid ANE power not measured separately but expected to be similar as both dispatch to the same ANE hardware. † macOS 26.3：CPU_AND_NE 会触发 ANE IPC 守护进程死锁，必须改用 compute_units=ALL。
‡ ANE 功耗（~0.22 W GPU + ~1.58 W ANE）由 ANE-LM 私有 API 预填充实测。CoreML 混合推理的 ANE 功耗未单独测量，但因两者均调度至同一 ANE 硬件，预计相近。

🔀 Hybrid Inference — Bottleneck Analysis🔀 混合推理 — 瓶颈分析

🔴 Cache Bridge (primary overhead)
CoreML numpy → MLX array conversion. Zero-copy under unified memory, but DeltaNet state layout alignment has CPU compute cost on every prefill. 缓存桥接（主要开销）
CoreML numpy → MLX array 格式转换。统一内存下零拷贝，但 DeltaNet 状态布局对齐在每次预填充后有额外 CPU 计算开销。

⚠️ Fixed seq_len waste
A 32-token prompt still pads to 64, wasting ANE compute. Prompts over 512 tokens need larger model variants. 固定 seq_len 浪费
32 token 的 prompt 也要填充到 64，浪费 ANE 算力。超过 512 token 的 prompt 需要更大模型变体。

⚠️ CoreML cold start
First load triggers on-device Metal+ANE kernel compilation (seq64: ~103 s, seq256: ~50 min, seq512: ~97 min). Subsequent loads take seconds. CoreML 冷启动
首次加载触发设备端 Metal+ANE 核编译（seq64：~103 秒，seq256：~50 分钟，seq512：~97 分钟）。后续加载仅需数秒。

✅ Decode unaffected (verified)
After switching to MLX, decode path is identical to pure MLX. Measured: baseline GPU decode 14.17 W, Hybrid 12.37 W. Throughput equal: 70.0 vs 70.2 tok/s. 解码无影响（已验证）
切换到 MLX 后，解码路径与纯 MLX 完全相同。实测：基准 GPU 解码 14.17 W，混合推理 12.37 W。吞吐量等价：70.0 vs 70.2 tok/s。

⚠️ Storage overhead
Requires HF model (4.5 GB) + CoreML .mlpackage files (seq64 1.9 GB + seq256 1.9 GB + seq512 2.0 GB) = 10.3 GB total, +128% vs pure MLX. 存储开销
需同时保留 HF 模型（4.5 GB）和 CoreML .mlpackage 文件（seq64 1.9 GB + seq256 1.9 GB + seq512 2.0 GB）= 总计 10.3 GB，比纯 MLX 多 +128%。

⚡ Pure MLX — Bottleneck Analysis⚡ 纯 MLX — 瓶颈分析

🔴 GPU prefill saturation (primary bottleneck)
With long prompts (128+ tokens), prefill is compute-bound with GPU at full load. TTFT scales linearly with prompt length. GPU 预填充饱和（主要瓶颈）
长 prompt（128+ token）时，预填充计算受限，GPU 满负载运行。首字时延随 prompt 长度线性增长。

🔴 Power and thermal (measured M2 Ultra)
GPU prefill: 62.05 W — 282× more than ANE prefill (0.22 W GPU + 1.58 W ANE). On mobile (TDP ~3–8 W), immediately triggers thermal throttling. 功耗与散热（M2 Ultra 实测）
GPU 预填充：62.05 W，是 ANE 预填充（0.22 W GPU + 1.58 W ANE）的 282 倍。移动端（TDP ~3–8 W）会立即触发散热降频。

⚠️ Decode is also memory-bound
Per-token generation is bandwidth-limited. M1 Max (400 GB/s) becomes a bottleneck for large models (9B+). 解码同样受内存带宽限制
逐 token 生成受带宽约束。M1 Max（400 GB/s）对大模型（9B+）会成为瓶颈。

✅ Simple architecture, no overhead
No cache bridge, no padding waste, native dynamic shape support. Lowest overall latency for short prompts. 架构简单，无额外开销
无缓存桥接，无填充浪费，原生动态形状支持。短 prompt 场景整体延迟最低。

Results性能数据

Hardware: Apple M2 Ultra, 192 GB, 800 GB/s · Model: Qwen3.5 family · Greedy decoding, 200 tokens generated

硬件：Apple M2 Ultra，192 GB，800 GB/s · 模型：Qwen3.5 系列 · 贪心解码，生成 200 token

Decode Throughput — MLX GPU Baseline (all models)解码吞吐量 — MLX GPU 基准（全部模型）

Model模型	Quant量化	Prompt提示词	TokensToken 数	TTFT (ms)	Decode (tok/s)解码 (tok/s)	Mem (GB)内存 (GB)
Qwen3.5-0.8B	FP16	short短	6	56	71.5	3.42
Qwen3.5-0.8B	FP16	medium中	133	69	70.2	3.58
Qwen3.5-0.8B	FP16	long长	410	96	69.0	4.18
Qwen3.5-2B	8-bit	short短	6	14	141.8	2.51
Qwen3.5-2B	8-bit	medium中	133	73	141.0	2.74
Qwen3.5-2B	8-bit	long长	410	162	138.9	3.23
Qwen3.5-2B	BF16	short短	6	22	101.3	4.16
Qwen3.5-2B	BF16	medium中	133	54	100.6	4.37
Qwen3.5-2B	BF16	long长	410	123	99.7	4.67
Qwen3.5-9B	8-bit	short短	6	39	56.4	9.76
Qwen3.5-9B	8-bit	medium中	133	265	56.1	10.00
Qwen3.5-9B	8-bit	long长	410	625	56.5	10.43

TTFT Comparison — All Four Pipelines (Qwen3.5-0.8B FP16)首字时延对比 — 全部四种管线（Qwen3.5-0.8B FP16）

Prompt提示词	Tokens	MLX GPU (ms)	CoreML Hybrid (ms)	ANE-LM (ms)	ANE-LM Hybrid (ms)	Best最优
short短	6 (18 w/template)	56	274	769	767	GPU
medium中	133 (145 w/ template)	69	411	5,867	6,060	GPU
long长	410 (422 w/template)	96	100	17,831	17,601	GPU ≈ CoreML Hybrid

* ANE-LM uses Qwen3.5 chat template, adding ~12 system-prompt tokens to each input.
* ANE-LM Hybrid: ANE-LM sequential prefill + MLX GPU decode, bridged via binary cache file. All numbers end-to-end measured.
* ANE-LM Hybrid's slow TTFT is caused by the private API's sequential dispatch (~42 ms/token), not ANE hardware speed. CoreML batched prefill achieves 4,128 tok/s on the same ANE — proving the hardware can match GPU throughput when given batched input.

* ANE-LM 使用 Qwen3.5 对话模板，每个输入额外增加 ~12 个系统提示 token。
* ANE-LM Hybrid：ANE-LM 顺序预填充 + MLX GPU 解码，通过二进制缓存文件桥接。所有数据均为端到端实测。
* ANE-LM Hybrid 的高 TTFT 源于私有 API 的逐 token 调度（~42 ms/token），而非 ANE 硬件速度。CoreML 批量预填充在同一 ANE 上达到 4,128 tok/s——证明硬件在批量输入下可匹敌 GPU 吞吐。

Decode Throughput — 0.8B FP16 (all backends)解码吞吐量 — 0.8B FP16（全部后端）

Backend后端	Prompt提示词	Decode (tok/s)解码 (tok/s)	vs. GPUvs. GPU
MLX GPU	short短	71.5	—
MLX GPU	medium中	70.2	—
MLX GPU	long长	69.0	—
CoreML Hybrid	short短	69.2	0.97×
CoreML Hybrid	medium中	71.3	1.02×
CoreML Hybrid	long长	73.3	1.06×
ANE-LM	short短	24.3	0.34×
ANE-LM	medium中	23.8	0.34×
ANE-LM	long长	22.8	0.33×
ANE-LM Hybrid	short短	66.6	0.93×
ANE-LM Hybrid	medium中	70.0	1.00×
ANE-LM Hybrid	long长	69.7	1.01×

Decode Speed — Visual (Qwen3.5-0.8B, long prompt)解码速度可视化（Qwen3.5-0.8B，长 prompt）

MLX GPU BaselineMLX GPU 基准

69.0 tok/s

CoreML Hybrid (ANE prefill + GPU decode)CoreML 混合推理（ANE 预填充 + GPU 解码）

73.3 tok/s

ANE-LM Private APIANE-LM 私有 API

22.8 tok/s

ANE-LM Hybrid (cache bridge, MLX GPU decode)ANE-LM 混合推理（缓存桥接，MLX GPU 解码）

69.7 tok/s

All decode speeds are end-to-end measured. ANE-LM Hybrid uses a binary cache bridge to transfer prefill state from ANE-LM to MLX GPU decode.

所有解码速度均为端到端实测。ANE-LM Hybrid 通过二进制缓存桥接将预填充状态从 ANE-LM 转移至 MLX GPU 解码。

TTFT Comparison — Visual (long prompt, 410 tokens)首字时延可视化（长 prompt，410 token）

MLX GPU — 96 ms

96 ms

CoreML Hybrid — 100 ms

100 ms

ANE-LM Private API — 17,831 ms

17,831 ms

ANE-LM Hybrid (cache bridge) — 17,601 ms

17,601 ms

Note: bars use log-scale equivalent visual widths for readability. All values end-to-end measured. ANE-LM Hybrid's 176× slower TTFT vs CoreML Hybrid is due to sequential dispatch (private API limitation), not ANE hardware — both use the same Neural Engine.

注：条形宽度使用对数等效视觉宽度以便阅读。所有数据均为端到端实测。ANE-LM Hybrid 比 CoreML Hybrid 慢 176× 源于逐 token 调度（私有 API 限制），而非 ANE 硬件——两者使用相同的神经引擎。

Prompt提示词	Tokens	seq_len	TTFT (ms)	Decode (tok/s)解码 (tok/s)	Peak Mem (GB)峰值内存 (GB)
short短	6	64	22	104.2	4.16
medium中	133	256	54	102.3	4.37
long长	410	512	122	100.7	4.67

Prompt提示词	Tokens	Baseline TTFT (ms)基准 TTFT (ms)	Hybrid TTFT (ms)混合 TTFT (ms)	Ratio比率
short短	6	22	22	1.0× (equal)（持平）
medium中	133	54	54	1.0× (equal)（持平）
long长	410	123	122	0.99× (equal)（持平）

Prompt提示	TokensToken 数	seq_len	TTFT (ms)	Decode (tok/s)解码速度	Peak Mem峰值内存
short短	6	64	319	50.0	10.12 GB
medium中	133	256	672	49.7	10.13 GB
long长	410	512	1,265	47.6	10.14 GB

Prompt提示	TokensToken 数	Baseline GPU基准 GPU	Hybrid ANE混合 ANE	Ratio比值
short短	6	39	319	8.2× slower更慢
medium中	133	265	672	2.5× slower更慢
long长	410	625	1,265	2.0× slower更慢

Power Efficiency功耗分析

Measured with powermetrics on M2 Ultra · Qwen3.5-0.8B FP16 · 100 ms sampling interval

使用 powermetrics 在 M2 Ultra 上测量 · Qwen3.5-0.8B FP16 · 100 ms 采样间隔

Phase阶段	GPU PowerGPU 功耗	ANE PowerANE 功耗	CPU PowerCPU 功耗	Notes备注
MLX GPU prefill (sustained loop)MLX GPU 预填充（持续循环）	62.05 W	0.00 W	2.34 W	332 iterations × 96 ms over 15 s15 秒内循环 332 次 × 96 ms
MLX GPU decodeMLX GPU 解码	14.17 W	0.00 W	2.98 W	200 tokens generated生成 200 token
ANE-LM prefill (private API)ANE-LM 预填充（私有 API）	0.22 W	1.58 W	3.77 W	17.6 s naturally sustained自然持续 17.6 秒
ANE-LM Hybrid decode (MLX GPU)ANE-LM 混合推理解码（MLX GPU）	12.37 W	0.00 W	5.85 W	Same GPU decode path as baseline与基准相同的 GPU 解码路径
ANE-LM pure decodeANE-LM 纯解码	0.16 W	1.42 W	5.47 W	ANE+CPU decode, ~3× slowerANE+CPU 解码，慢约 3×

Key result: ANE prefill saves 60.25 W GPU power (282× reduction). On mobile devices with 3–8 W total TDP, GPU prefill at 62 W would immediately trigger thermal throttling; ANE prefill at ~1.8 W fits well within thermal budget.

核心结论：ANE 预填充节省 60.25 W GPU 功耗（降低 282×）。移动端 TDP 仅 3–8 W，GPU 预填充的 62 W 会立即触发散热降频；ANE 预填充的 ~1.8 W 完全在热功耗预算以内。

Crossover Point — GPU Core Count vs. ANE Benefit交叉阈值 — GPU 核心数与 ANE 收益

The prompt length where ANE prefill matches GPU prefill speed scales with GPU core count. Fewer GPU cores → lower crossover threshold → broader range of prompts benefit from ANE offloading.

ANE 预填充速度追上 GPU 的 prompt 长度阈值随 GPU 核心数缩放。核心数越少 → 阈值越低 → 更多 prompt 可受益于 ANE 卸载。

Chip芯片	GPU CoresGPU 核心数	Est. GPU Prefill (tok/s)GPU 预填充估算 (tok/s)	Crossover Length交叉阈值	Source来源
M2 Ultra	76	~4,128	~410 tokens	Measured实测
M1 Max	32	~2,100	~200 tokens	Estimated估算
M1	8	~500	~50 tokens	Estimated估算
A17 Pro	6	~400	~40 tokens	Estimated估算

On devices with 6–8 GPU cores (most iPhones), ANE prefill is faster for virtually all practical prompt lengths while consuming orders of magnitude less power.

在 6–8 GPU 核心的设备（大多数 iPhone）上，ANE 预填充对几乎所有实际 prompt 长度都更快，同时功耗低数个数量级。

Full Inference Power Consumption — Baseline MLX (GPU Only)完整推理功耗 — 基准 MLX（纯 GPU）

Measured via powermetrics/asitop during full inference (prefill + decode), 4 runs each. All models on M2 Ultra.

使用 powermetrics/asitop 在完整推理（预填充 + 解码）期间测量，每项 4 次运行。全部模型在 M2 Ultra 上。

Model模型	Prompt提示词	CPU (W)	GPU (W)	Total (W)总计 (W)
0.8B FP16	short短	9.5	6.7	16.2
0.8B FP16	medium中	7.0	18.0	25.0
0.8B FP16	long长	6.7	19.0	25.7
2B 8-bit	short短	9.0	21.2	30.2
2B 8-bit	medium中	8.4	25.8	34.2
2B 8-bit	long长	8.7	30.9	39.6
2B BF16	short短	8.8	19.3	28.1
2B BF16	medium中	8.5	21.3	29.8
2B BF16	long长	7.9	23.7	31.6
9B 8-bit	short短	6.6	36.5	43.1
9B 8-bit	medium中	6.2	41.6	47.8
9B 8-bit	long长	6.3	46.9	53.2

Full Inference Power Consumption — Hybrid ANE (CoreML Prefill + MLX Decode)完整推理功耗 — 混合 ANE（CoreML 预填充 + MLX 解码）

Model模型	Prompt提示词	CPU (W)	GPU (W)	ANE (W)	Total (W)总计 (W)
0.8B FP16	short短	7.4	14.6	0.024	22.0
0.8B FP16	medium中	10.5	5.2	0.002	15.7
0.8B FP16	long长	10.4	6.2	0.017	16.6
9B 8-bit	short短	8.1	31.3	0	39.4
9B 8-bit	medium中	9.5	21.5	0	31.0
9B 8-bit	long长	11.1	11.4	0	22.5

⚠️

Key Finding: ANE Power is 0 W 关键发现：ANE 功耗为 0 W

ANE power is essentially 0 W across all hybrid runs — despite using compute_units=ALL, CoreML routes computation through GPU, not ANE. The "ANE prefill" is a misnomer: CoreML is performing GPU-based prefill with optimized kernels.

The hybrid pipeline's power savings come from CoreML's more efficient GPU kernel utilization, not from ANE offloading:
• 0.8B long prompt: 16.6 W (hybrid) vs 25.7 W (baseline) — 35% reduction
• 9B long prompt: 22.5 W (hybrid) vs 53.2 W (baseline) — 58% reduction
• ANE power never exceeds 0.024 W in any hybrid configuration

所有混合推理运行中 ANE 功耗实质上为 0 W — 尽管使用了 compute_units=ALL， CoreML 将计算路由到 GPU 而非 ANE。"ANE 预填充"是误称：CoreML 实际执行的是基于 GPU 的优化内核预填充。

混合管线的功耗节省来自 CoreML 更高效的 GPU 内核利用，而非 ANE 卸载：
• 0.8B 长 prompt：16.6 W（混合）vs 25.7 W（基准）— 降低 35%
• 9B 长 prompt：22.5 W（混合）vs 53.2 W（基准）— 降低 58%
• 任何混合配置中 ANE 功耗均未超过 0.024 W

This revises our earlier per-phase power data (measured via ANE-LM private API), which showed genuine ANE utilization at 1.58 W. The private API dispatches directly to ANE hardware, whereas CoreML's compute_units=ALL on macOS 26.3 appears to prefer GPU execution even when ANE is nominally available.

这修正了我们早期的分阶段功耗数据（通过 ANE-LM 私有 API 测量），后者显示真正的 ANE 利用为 1.58 W。私有 API 直接调度到 ANE 硬件，而 CoreML 在 macOS 26.3 上的 compute_units=ALL 即使 ANE 名义上可用，也似乎更倾向于 GPU 执行。

Key Findings核心发现

What we learned about Apple Silicon LLM inference

我们从 Apple Silicon LLM 推理中得出的结论

⚡

Decode is Bandwidth-Bound解码受内存带宽限制

Decode throughput tracks memory bandwidth, not FLOP count: 2B 8-bit (141 tok/s) > 2B BF16 (101 tok/s) > 0.8B FP16 (71 tok/s) > 9B 8-bit (56 tok/s). M2 Ultra (800 GB/s) is ~1.9× faster than M1 Max (400 GB/s) for all models.

解码吞吐量与内存带宽正相关，而非 FLOP 数：2B 8-bit（141 tok/s）> 2B BF16（101 tok/s）> 0.8B FP16（71 tok/s）> 9B 8-bit（56 tok/s）。M2 Ultra（800 GB/s）对所有模型约比 M1 Max（400 GB/s）快 1.9×。

🎯

ANE Crossover Depends on Model SizeANE 交叉阈值取决于模型大小

For 0.8B: CoreML dispatch overhead (250–400 ms) dominates below ~410 tokens — hybrid only matches GPU at seq512 (4,128 tok/s).
For 2B BF16: zero dispatch overhead at all prompt lengths — hybrid TTFT equals GPU baseline even at 6 tokens. Larger models amortize CoreML dispatch cost better.

0.8B 模型：CoreML 调度开销（250–400 ms）在低于 ~410 token 时主导——混合方案仅在 seq512 时与 GPU 持平（4,128 tok/s）。
2B BF16 模型：所有 prompt 长度零调度开销 — 混合 TTFT 即使在 6 token 也与 GPU 基准持平。更大的模型更好地摊销 CoreML 调度成本。

🔍

Sequential ANE ≠ Batched ANE顺序 ANE ≠ 批量 ANE

ANE-LM's per-token dispatch via private API yields ~24 tok/s — 3× slower than GPU. Each ANE kernel call takes ~42 ms, making TTFT proportional to prompt length. Batched CoreML processing is fundamentally different from sequential dispatch.

ANE-LM 的私有 API 逐 token 调度约 24 tok/s——比 GPU 慢 3×。每次 ANE 核调用约 43 ms，首字时延与 prompt 长度成正比。批量 CoreML 处理与顺序调度有本质区别。

✅

Cache Bridge Works Correctly缓存桥接正确有效

Hybrid decode speed (69–73 tok/s) matches or slightly exceeds baseline (69–71 tok/s), confirming the cache bridge correctly transfers both DeltaNet recurrent states and full-attention KV caches from CoreML to MLX.

混合推理解码速度（69–73 tok/s）与基准（69–71 tok/s）持平甚至略高，证实缓存桥接正确将 DeltaNet 循环状态和全注意力 KV 缓存从 CoreML 无损转移至 MLX。

⏱️

CoreML First-Load CompilationCoreML 首次加载编译

First-time CoreML model load triggers on-device Metal+ANE kernel compilation: seq64 ~103 s, seq256 ~50 min, seq512 ~97 min. Results are cached — subsequent loads take seconds. Plan for one-time compilation cost per machine.

首次加载 CoreML 模型会触发设备端 Metal+ANE 核编译：seq64 约 103 秒，seq256 约 50 分钟，seq512 约 97 分钟。编译结果缓存后续加载仅需数秒。每台机器需规划一次性编译时间。

🔋

282× GPU Power ReductionGPU 功耗降低 282×

ANE prefill measured at 0.22 W GPU + 1.58 W ANE = 1.8 W total, vs GPU prefill at 62.05 W — a 282× reduction. Saves 60.25 W per request. On mobile (TDP ~3–8 W), GPU prefill would immediately trigger thermal throttling; ANE prefill fits within budget.

ANE 预填充实测：GPU 0.22 W + ANE 1.58 W = 总计 1.8 W，对比 GPU 预填充 62.05 W——降低 282×。每次请求节省 60.25 W。移动端（TDP ~3–8 W）GPU 预填充会立即触发散热降频；ANE 预填充完全在热功耗预算以内。

📈

Larger Models + Longer Prompts = Best Hybrid越大模型 + 越长 Prompt = 混合推理越有利

Model size ↑ → dispatch overhead ↓: 0.8B (hidden=1024) has 250 ms overhead at seq64; 2B (hidden=1536) has zero overhead. Larger hidden dims increase compute-per-dispatch, amortizing fixed IPC cost.
Prompt length ↑ → prefill throughput ↑: 0.8B goes from 22 tok/s (seq64) to 4,128 tok/s (seq512). Less padding waste = higher ANE utilization.
Tradeoffs: conversion time grows super-linearly (2B seq512: ~120 min vs 0.8B: ~22 min); first-load ANE compilation takes 15+ min for 2B; storage doubles (MLX weights + CoreML packages).
On mobile (A-series, 6–8 GPU cores), crossover falls below 64 tokens — nearly all prompts benefit from ANE prefill.

模型越大 → 调度开销越小：0.8B（hidden=1024）在 seq64 有 250 ms 开销；2B（hidden=1536）零开销。更大的隐层维度增加每次调度的计算量，摊销固定 IPC 成本。
Prompt 越长 → 预填充吞吐越高：0.8B 从 22 tok/s（seq64）到 4,128 tok/s（seq512）。更少的填充浪费 = 更高的 ANE 利用率。
反面约束：转换时间超线性增长（2B seq512: ~120 min vs 0.8B: ~22 min）；首次 ANE 编译 2B 需 15+ min；存储翻倍（MLX 权重 + CoreML 模型）。
在移动设备（A 系列，6–8 GPU 核心）上，交叉阈值降至 64 token 以下——几乎所有 prompt 都受益于 ANE 预填充。

🔮

M5: Native ANE via Metal 4M5：通过 Metal 4 原生访问 ANE

On M5, MLX accesses Neural Accelerators directly via Metal 4 Tensor API — no CoreML dispatch layer needed. This achieves up to 4× TTFT improvement for 14B models vs. M4 [Apple ML Research, 2025], validating the prefill-compute hypothesis.

在 M5 上，MLX 通过 Metal 4 Tensor API 直接访问神经加速器，无需 CoreML 调度层。相比 M4 对 14B 模型可实现最高 4× 首字时延改善 [Apple ML Research, 2025]，验证了预填充算力的核心假设。

Decode Throughput vs. Bandwidth — MLX GPU Baseline解码吞吐量 vs. 内存带宽 — MLX GPU 基准

All models, long prompt (410 tokens), M2 Ultra 800 GB/s

所有模型，长 prompt（410 token），M2 Ultra 800 GB/s

2B 8-bit — 2.0 GB weights2B 8-bit — 权重 2.0 GB

138.9 tok/s

2B BF16 — 4.0 GB weights2B BF16 — 权重 4.0 GB

99.7 tok/s

0.8B FP16 — 1.6 GB weights0.8B FP16 — 权重 1.6 GB

69.0 tok/s

9B 8-bit — 9.5 GB weights9B 8-bit — 权重 9.5 GB

56.5 tok/s

Spec规格	M1 Max	M2 Ultra	Ratio比值
CPU coresCPU 核心数	10 (8P+2E)	24 (16P+8E)	2.4×
GPU coresGPU 核心数	32	76	2.4×
ANE coresANE 核心数	16	32	2.0×
ANE TOPS	15.8	31.6	2.0×
Unified memory统一内存	32 GB	192 GB	6.0×
Memory bandwidth内存带宽	400 GB/s	800 GB/s	2.0×
0.8B decode0.8B 解码速度	~37 tok/s	71 tok/s	1.9×
2B 8-bit decode2B 8-bit 解码速度	~95 tok/s	141 tok/s	1.5×
9B 8-bit decode9B 8-bit 解码速度	~30 tok/s	56 tok/s	1.9×

Disaggregated LLM Inference
on Apple Silicon Apple Silicon 异构 LLM 推理

Overview概述

Approach技术方案

Pipeline 1 — MLX GPU (Baseline)管线 1 — MLX GPU（基准）

Pipeline 2 — Hybrid CoreML + MLX管线 2 — 混合 CoreML + MLX

Pipeline 3 — ANE-LM (Private API)管线 3 — ANE-LM（私有 API）

Pipeline 4 — ANE-LM Hybrid管线 4 — ANE-LM 混合推理

Architecture Comparison架构对比

🔀 Hybrid Inference — Bottleneck Analysis🔀 混合推理 — 瓶颈分析

⚡ Pure MLX — Bottleneck Analysis⚡ 纯 MLX — 瓶颈分析

Results性能数据

Decode Throughput — MLX GPU Baseline (all models)解码吞吐量 — MLX GPU 基准（全部模型）

TTFT Comparison — All Four Pipelines (Qwen3.5-0.8B FP16)首字时延对比 — 全部四种管线（Qwen3.5-0.8B FP16）

Decode Throughput — 0.8B FP16 (all backends)解码吞吐量 — 0.8B FP16（全部后端）

Decode Speed — Visual (Qwen3.5-0.8B, long prompt)解码速度可视化（Qwen3.5-0.8B，长 prompt）

TTFT Comparison — Visual (long prompt, 410 tokens)首字时延可视化（长 prompt，410 token）

Qwen3.5-2B BF16 — Hybrid CoreML + MLXQwen3.5-2B BF16 — 混合 CoreML + MLX

2B-BF16: Hybrid vs Baseline TTFT Comparison2B-BF16：混合推理 vs 基准 TTFT 对比

Qwen3.5-9B 8-bit — Hybrid CoreML + MLXQwen3.5-9B 8-bit — 混合 CoreML + MLX

9B: Hybrid ANE Results9B：混合 ANE 结果

9B: Hybrid vs Baseline TTFT Comparison9B：混合推理 vs 基准 TTFT 对比

Power Efficiency功耗分析

Crossover Point — GPU Core Count vs. ANE Benefit交叉阈值 — GPU 核心数与 ANE 收益

Full Inference Power Consumption — Baseline MLX (GPU Only)完整推理功耗 — 基准 MLX（纯 GPU）

Full Inference Power Consumption — Hybrid ANE (CoreML Prefill + MLX Decode)完整推理功耗 — 混合 ANE（CoreML 预填充 + MLX 解码）

Key Findings核心发现

Decode Throughput vs. Bandwidth — MLX GPU Baseline解码吞吐量 vs. 内存带宽 — MLX GPU 基准

Hardware Comparison硬件对比

Disaggregated LLM Inferenceon Apple Silicon Apple Silicon 异构 LLM 推理

Overview概述

Approach技术方案

Pipeline 1 — MLX GPU (Baseline)管线 1 — MLX GPU（基准）

Pipeline 2 — Hybrid CoreML + MLX管线 2 — 混合 CoreML + MLX

Pipeline 3 — ANE-LM (Private API)管线 3 — ANE-LM（私有 API）

Pipeline 4 — ANE-LM Hybrid管线 4 — ANE-LM 混合推理

Architecture Comparison架构对比

🔀 Hybrid Inference — Bottleneck Analysis🔀 混合推理 — 瓶颈分析

⚡ Pure MLX — Bottleneck Analysis⚡ 纯 MLX — 瓶颈分析

Results性能数据

Decode Throughput — MLX GPU Baseline (all models)解码吞吐量 — MLX GPU 基准（全部模型）

TTFT Comparison — All Four Pipelines (Qwen3.5-0.8B FP16)首字时延对比 — 全部四种管线（Qwen3.5-0.8B FP16）

Decode Throughput — 0.8B FP16 (all backends)解码吞吐量 — 0.8B FP16（全部后端）

Decode Speed — Visual (Qwen3.5-0.8B, long prompt)解码速度可视化（Qwen3.5-0.8B，长 prompt）

TTFT Comparison — Visual (long prompt, 410 tokens)首字时延可视化（长 prompt，410 token）

Qwen3.5-2B BF16 — Hybrid CoreML + MLXQwen3.5-2B BF16 — 混合 CoreML + MLX

2B-BF16: Hybrid vs Baseline TTFT Comparison2B-BF16：混合推理 vs 基准 TTFT 对比

Qwen3.5-9B 8-bit — Hybrid CoreML + MLXQwen3.5-9B 8-bit — 混合 CoreML + MLX

9B: Hybrid ANE Results9B：混合 ANE 结果

9B: Hybrid vs Baseline TTFT Comparison9B：混合推理 vs 基准 TTFT 对比

Power Efficiency功耗分析

Crossover Point — GPU Core Count vs. ANE Benefit交叉阈值 — GPU 核心数与 ANE 收益

Full Inference Power Consumption — Baseline MLX (GPU Only)完整推理功耗 — 基准 MLX（纯 GPU）

Full Inference Power Consumption — Hybrid ANE (CoreML Prefill + MLX Decode)完整推理功耗 — 混合 ANE（CoreML 预填充 + MLX 解码）

Key Findings核心发现

Decode Throughput vs. Bandwidth — MLX GPU Baseline解码吞吐量 vs. 内存带宽 — MLX GPU 基准

Hardware Comparison硬件对比

Disaggregated LLM Inference
on Apple Silicon Apple Silicon 异构 LLM 推理