ANE Batch Prefill for On-Device Parallel LLM Inference ANE 批量预填充:端侧并行 LLM 推理

Enabling concurrent ANE prefill and GPU decode on Apple Silicon via batched ANE dispatch with the private AppleNeuralEngine framework. 通过 AppleNeuralEngine 私有框架的批量派发,在 Apple Silicon 上实现 ANE 预填充与 GPU 解码的并发执行。

M2 Ultra · 76 GPU · 32 ANE Qwen3.5-0.8B / 2B ANE_SPATIAL = 32 Companion to 配套 hybrid-ane-mlx-bench
11.3×
Prefill speedup
vs sequential dispatch
预填充加速
相比逐 token 派发
268
tok/s batch prefill
Qwen3.5-0.8B
tok/s 批量预填充
Qwen3.5-0.8B
<30ms
State transfer
overhead
状态迁移
开销
282×
GPU power reduction
during ANE prefill
ANE 预填充时
GPU 功耗降低

Key Findings核心发现

11.3× Prefill Speedup11.3 倍预填充加速

ANE batch prefill (32 tok/dispatch) achieves 268 tok/s vs 23.7 tok/s for sequential dispatch from our prior work, making private API prefill practical for interactive use.ANE 批量预填充(32 tok/次派发)达到 268 tok/s,而前作中逐 token 派发仅 23.7 tok/s,使私有 API 预填充可用于交互场景。

ANE Beats GPU on Short Prompts短提示词 ANE 快于 GPU

At 13 tokens, ANE batch prefill (149 tok/s) outperforms GPU prefill (138 tok/s) for 0.8B. This short-prompt advantage is critical for chatbots and code completion.在 13 token 时,0.8B 模型 ANE 批量预填充(149 tok/s)超越 GPU(138 tok/s)。短提示优势对聊天和代码补全至关重要。

Concurrent Pipeline Feasibility并发流水线可行性

ANE prefill uses only 0.22 W GPU power, freeing the GPU to simultaneously decode another request at ~90 tok/s. Zero GPU contention between prefill and decode.ANE 预填充仅消耗 0.22 W GPU 功耗,GPU 可同时对另一请求解码(~90 tok/s)。预填充与解码之间零 GPU 竞争。

Negligible Transfer Overhead可忽略的状态迁移开销

State serialization (save + read + cache construction) totals <30 ms for both 0.8B and 2B. The binary format does not bottleneck the hybrid pipeline.状态序列化(保存+读取+缓存构建)对 0.8B 和 2B 总计 <30 ms。二进制格式不会成为混合流水线瓶颈。

Benchmark Results基准测试结果

All values averaged over 10 runs, temperature = 0, max 20 decode tokens. M2 Ultra, 192 GB.所有数值为 10 次运行的平均值,温度 = 0,最大解码 20 token。M2 Ultra, 192 GB。

Qwen3.5-0.8B FP16

Table 1. Throughput comparison: Pure ANE, Pure MLX (GPU), and Hybrid (ANE prefill + MLX decode).表 1. 吞吐量对比:纯 ANE、纯 MLX (GPU) 和混合模式(ANE 预填充 + MLX 解码)。
Prompt提示词 Mode模式 TokensToken Prefill tok/s预填充 tok/s Decode tok/s解码 tok/s Total ms总耗时 ms
ShortANE13148.717.62,931
MLX13138.261.23,679
Hybrid13148.339.46,050
MediumANE28237.516.43,645
MLX28290.064.53,836
Hybrid28237.953.96,280
LongANE74268.017.13,774
MLX74736.264.13,846
Hybrid74271.154.06,335

Qwen3.5-2B BF16

Table 2. Throughput comparison for the larger 2B model.表 2. 较大 2B 模型的吞吐量对比。
Prompt提示词 Mode模式 TokensToken Prefill tok/s预填充 tok/s Decode tok/s解码 tok/s Total ms总耗时 ms
ShortANE1391.712.45,271
MLX13179.690.84,076
Hybrid1392.822.68,878
MediumANE28160.412.56,189
MLX28358.094.14,212
Hybrid28159.329.09,281
LongANE74172.712.76,455
MLX74828.692.44,274
Hybrid74173.129.29,518

Prefill Throughput: 0.8B预填充吞吐量:0.8B

Short (13 tok)
ANE
148.7
MLX
138.2
Hybrid
148.3
Medium (28 tok)
ANE
237.5
MLX
290.0
Hybrid
237.9
Long (74 tok)
ANE
268.0
MLX
736.2
Hybrid
271.1

Prefill Throughput: 2B预填充吞吐量:2B

Short (13 tok)
ANE
91.7
MLX
179.6
Hybrid
92.8
Medium (28 tok)
ANE
160.4
MLX
358.0
Hybrid
159.3
Long (74 tok)
ANE
172.7
MLX
828.6
Hybrid
173.1

Batch vs Sequential Dispatch (74 tokens)批量 vs 逐 token 派发(74 tokens)

Model模型 Dispatch Mode派发模式 Prefill tok/s预填充 tok/s Speedup加速比 Source来源
0.8BSequential (1 tok/dispatch)逐 token(1 tok/次)23.71.0×Prior work前作
Batch (32 tok/dispatch)批量(32 tok/次)268.011.3×This work本文
GPU baseline (MLX)GPU 基线(MLX)736.231.1×This work本文
2BSequential (1 tok/dispatch)逐 token(1 tok/次)23.71.0×Prior work前作
Batch (32 tok/dispatch)批量(32 tok/次)172.77.3×This work本文
GPU baseline (MLX)GPU 基线(MLX)828.635.0×This work本文

State Transfer Overhead状态迁移开销

Model模型Save保存Read读取Cache构建Total总计
0.8B12 ms9 ms6 ms27 ms
2B11 ms9 ms6 ms26 ms

Architecture架构

Pure ANE

Prompt tokens提示词 tokens
ANE Batch Prefill
32 tok/dispatch
CPU: RMSNorm, RoPE, SSM, GQA
ANE Sequential Decode
~12–18 tok/s

Pure MLX (GPU)

Prompt tokens提示词 tokens
GPU Prefill (Metal)
GPU Decode (Metal)
~61–94 tok/s

Hybrid (This Work)混合(本文)

Prompt tokens提示词 tokens
ANE Batch Prefill
32 tok/dispatch
State Serialize (<30 ms)状态序列化 (<30 ms)
MLX GPU Decode
~23–54 tok/s

Concurrent Pipeline并发流水线

Since ANE prefill uses only 0.22 W GPU power, the GPU can simultaneously decode another request:由于 ANE 预填充仅消耗 0.22 W GPU 功耗,GPU 可同时对另一请求进行解码:

Request k+1 ANE Batch PrefillANE 批量预填充
149–268 tok/s · ~1.8 W
Request k GPU DecodeGPU 解码
~90 tok/s · ~14 W

Total ~15.8 W vs 76 W if both phases used GPU. 79% power reduction for the prefill component.总计约 15.8 W,若两阶段均用 GPU 则为 76 W。预填充部分功耗降低 79%

Seamless Conversation Pipeline顺滑对话流水线

In a single conversation, the ANE prefills the user's new input while the GPU is still decoding the previous response. The moment decode finishes, the next decode can begin immediately — zero wait time between turns.在单轮对话中,GPU 解码上一轮回复的同时,ANE 已开始预填充用户的新输入。解码结束的瞬间,下一轮解码即可立即开始 — 轮次之间零等待。

Hardware硬件
Time →时间 →
Turn 1: User sends “Hello”第 1 轮:用户发送「你好」
ANE
Prefill “Hello”预填充「你好」
idle空闲
Prefill Turn 2 input预填充第 2 轮输入
GPU
idle空闲
Decode response 1解码回复 1
Decode response 2解码回复 2
User用户
sends发送
reading response 1...阅读回复 1...
sends发送
reading...阅读...
↑ While GPU decodes Turn 1, ANE already prefills Turn 2 → instant decode start ↑ GPU 解码第 1 轮时,ANE 已预填充第 2 轮 → 解码瞬间启动

Key insight: In GPU-only mode, the user must wait for both prefill + decode sequentially each turn. With the hybrid pipeline, the ANE prefills the next input during the user's reading time, so TTFT for turn 2+ is just state transfer (~27 ms).核心洞察:纯 GPU 模式下,每轮用户都要等待预填充+解码串行完成。混合流水线中,ANE 在用户阅读期间预填充下一轮输入,因此第 2 轮起 TTFT 仅为状态迁移时间(~27 ms)。

Multi-Turn Timing Comparison多轮对话时间对比

Scenario: 3-turn conversation, each turn 74-token input, 100-token response. Qwen3.5-0.8B on M2 Ultra.场景:3 轮对话,每轮 74 token 输入,100 token 回复。Qwen3.5-0.8B,M2 Ultra。

User-perceived latency per turn. TTFT = time from send to first token.用户感知延迟。TTFT = 发送到首 token 的时间。
Prefill预填充 Transfer迁移 Decode (100 tok)解码 (100 tok) TTFT Total/turn每轮总计
Pure MLX (GPU only, sequential)纯 MLX(仅 GPU,串行)
Every turn每一轮 101 ms1,562 ms 101 ms1,663 ms
Hybrid Concurrent (ANE prefill overlapped with user reading)混合并发(ANE 预填充与用户阅读重叠)
Turn 1 (cold)第 1 轮(冷启动) 276 ms27 ms1,562 ms 303 ms1,865 ms
Turn 2+ (warm)第 2 轮起(热) hidden*隐藏*27 ms1,562 ms 27 ms1,589 ms

*Prefill for turn N+1 executes on ANE while user reads turn N response. By the time user sends next message, prefill is complete. GPU decode uses pure MLX speed since it runs independently.*第 N+1 轮的预填充在 ANE 上执行,与用户阅读第 N 轮回复同时进行。用户发送下一条消息时,预填充已完成。GPU 解码使用纯 MLX 速度,因为独立运行。

3-turn total (Pure MLX):
3 × 1,663 = 4,989 ms
TTFT every turn: 101 ms
3 轮总计(纯 MLX):
3 × 1,663 = 4,989 ms
每轮 TTFT:101 ms

3-turn total (Hybrid):
1,865 + 2 × 1,589 = 5,043 ms
TTFT turn 2+: 27 ms (3.7× faster)
3 轮总计(混合):
1,865 + 2 × 1,589 = 5,043 ms
第 2 轮起 TTFT:27 ms(快 3.7 倍)

Total time is comparable, but hybrid delivers 3.7× faster TTFT from turn 2 onwards — the user sees the first token almost instantly. Meanwhile, the GPU's thermal budget is preserved since ANE handles prefill at only 1.8 W instead of 62 W.总时间相当,但混合模式从第 2 轮起 TTFT 快 3.7 倍 — 用户几乎瞬间看到首 token。同时,由于 ANE 仅以 1.8 W 而非 62 W 处理预填充,GPU 热预算得以保留。

Power Analysis能耗分析

Power data from our prior work on the same M2 Ultra, measured via powermetrics at 100 ms sampling intervals.功耗数据来自前作,使用同一台 M2 Ultra,通过 powermetrics 以 100 ms 间隔采样。

Table 3. Per-phase power consumption (Qwen3.5-0.8B FP16).表 3. 各阶段功耗(Qwen3.5-0.8B FP16)。
Phase阶段 GPU (W)ANE (W)CPU (W) Total总计
GPU prefill (compute-bound)GPU 预填充(计算密集)62.050.002.3464.39 W
GPU decode (bandwidth-bound)GPU 解码(带宽受限)14.170.002.9817.15 W
ANE prefill (private API)ANE 预填充(私有 API)0.221.583.775.57 W
ANE pure decodeANE 纯解码0.161.425.477.05 W

Mobile implication: On A-series devices (TDP ~3–8 W), GPU prefill at 62 W immediately triggers thermal throttling. ANE prefill at ~1.8 W fits within the mobile thermal envelope, enabling sustained inference without clock-speed degradation.移动端意义:在 A 系列设备上(TDP 约 3–8 W),62 W 的 GPU 预填充会立即触发温控降频。ANE 预填充仅约 1.8 W,在移动端热设计功耗范围内,可实现持续推理而不降频。

Test Hardware测试硬件

Machine机型Mac Studio (2023)
Chip芯片Apple M2 Ultra
CPUCPU24 cores (16P + 8E)
GPU76 cores
ANE32 cores (31.6 TOPS)
Memory内存192 GB unified
Bandwidth带宽800 GB/s
MetalMetal 4
OSmacOS 26 (Tahoe)