ANE Batch Prefill for On-Device Parallel LLM Inference

Table 1. Throughput comparison: Pure ANE, Pure MLX (GPU), and Hybrid (ANE prefill + MLX decode).表 1. 吞吐量对比：纯 ANE、纯 MLX (GPU) 和混合模式（ANE 预填充 + MLX 解码）。
Prompt提示词	Mode模式	TokensToken	Prefill tok/s预填充 tok/s	Decode tok/s解码 tok/s	Total ms总耗时 ms
Short	ANE	13	148.7	17.6	2,931
MLX	13	138.2	61.2	3,679
Hybrid	13	148.3	39.4	6,050
Medium	ANE	28	237.5	16.4	3,645
MLX	28	290.0	64.5	3,836
Hybrid	28	237.9	53.9	6,280
Long	ANE	74	268.0	17.1	3,774
MLX	74	736.2	64.1	3,846
Hybrid	74	271.1	54.0	6,335

Table 2. Throughput comparison for the larger 2B model.表 2. 较大 2B 模型的吞吐量对比。
Prompt提示词	Mode模式	TokensToken	Prefill tok/s预填充 tok/s	Decode tok/s解码 tok/s	Total ms总耗时 ms
Short	ANE	13	91.7	12.4	5,271
MLX	13	179.6	90.8	4,076
Hybrid	13	92.8	22.6	8,878
Medium	ANE	28	160.4	12.5	6,189
MLX	28	358.0	94.1	4,212
Hybrid	28	159.3	29.0	9,281
Long	ANE	74	172.7	12.7	6,455
MLX	74	828.6	92.4	4,274
Hybrid	74	173.1	29.2	9,518

Model模型	Dispatch Mode派发模式	Prefill tok/s预填充 tok/s	Speedup加速比	Source来源
0.8B	Sequential (1 tok/dispatch)逐 token（1 tok/次）	23.7	1.0×	Prior work前作
Batch (32 tok/dispatch)批量（32 tok/次）	268.0	11.3×	This work本文
GPU baseline (MLX)GPU 基线（MLX）	736.2	31.1×	This work本文
2B	Sequential (1 tok/dispatch)逐 token（1 tok/次）	23.7	1.0×	Prior work前作
Batch (32 tok/dispatch)批量（32 tok/次）	172.7	7.3×	This work本文
GPU baseline (MLX)GPU 基线（MLX）	828.6	35.0×	This work本文

Model模型	Save保存	Read读取	Cache构建	Total总计
0.8B	12 ms	9 ms	6 ms	27 ms
2B	11 ms	9 ms	6 ms	26 ms

Architecture架构

Pure ANE

Prompt tokens提示词 tokens

↓

ANE Batch Prefill
32 tok/dispatch

↓

CPU: RMSNorm, RoPE, SSM, GQA

↓

ANE Sequential Decode

↓

~12–18 tok/s

Pure MLX (GPU)

Prompt tokens提示词 tokens

↓

GPU Prefill (Metal)

↓

GPU Decode (Metal)

↓

~61–94 tok/s

Hybrid (This Work)混合（本文）

Prompt tokens提示词 tokens

↓

ANE Batch Prefill
32 tok/dispatch

↓

State Serialize (<30 ms)状态序列化 (<30 ms)

↓

MLX GPU Decode

↓

~23–54 tok/s

Concurrent Pipeline并发流水线

Since ANE prefill uses only 0.22 W GPU power, the GPU can simultaneously decode another request:由于 ANE 预填充仅消耗 0.22 W GPU 功耗，GPU 可同时对另一请求进行解码：

Request k+1 ANE Batch PrefillANE 批量预填充
149–268 tok/s · ~1.8 W

Request k GPU DecodeGPU 解码
~90 tok/s · ~14 W

Total ~15.8 W vs 76 W if both phases used GPU. 79% power reduction for the prefill component.总计约 15.8 W，若两阶段均用 GPU 则为 76 W。预填充部分功耗降低 79%。

Seamless Conversation Pipeline顺滑对话流水线

In a single conversation, the ANE prefills the user's new input while the GPU is still decoding the previous response. The moment decode finishes, the next decode can begin immediately — zero wait time between turns.在单轮对话中，GPU 解码上一轮回复的同时，ANE 已开始预填充用户的新输入。解码结束的瞬间，下一轮解码即可立即开始 — 轮次之间零等待。

Hardware硬件

Time →时间 →

Turn 1: User sends “Hello”第 1 轮：用户发送「你好」

ANE

Prefill “Hello”预填充「你好」

idle空闲

Prefill Turn 2 input预填充第 2 轮输入

GPU

idle空闲

↓

Decode response 1解码回复 1

↓

Decode response 2解码回复 2

User用户

sends发送

reading response 1...阅读回复 1...

sends发送

reading...阅读...

↑ While GPU decodes Turn 1, ANE already prefills Turn 2 → instant decode start ↑ GPU 解码第 1 轮时，ANE 已预填充第 2 轮 → 解码瞬间启动

Key insight: In GPU-only mode, the user must wait for both prefill + decode sequentially each turn. With the hybrid pipeline, the ANE prefills the next input during the user's reading time, so TTFT for turn 2+ is just state transfer (~27 ms).核心洞察：纯 GPU 模式下，每轮用户都要等待预填充+解码串行完成。混合流水线中，ANE 在用户阅读期间预填充下一轮输入，因此第 2 轮起 TTFT 仅为状态迁移时间（~27 ms）。

Multi-Turn Timing Comparison多轮对话时间对比

Scenario: 3-turn conversation, each turn 74-token input, 100-token response. Qwen3.5-0.8B on M2 Ultra.场景：3 轮对话，每轮 74 token 输入，100 token 回复。Qwen3.5-0.8B，M2 Ultra。

User-perceived latency per turn. TTFT = time from send to first token.用户感知延迟。TTFT = 发送到首 token 的时间。
	Prefill预填充	Transfer迁移	Decode (100 tok)解码 (100 tok)	TTFT	Total/turn每轮总计
Pure MLX (GPU only, sequential)纯 MLX（仅 GPU，串行）
Every turn每一轮	101 ms	—	1,562 ms	101 ms	1,663 ms
Hybrid Concurrent (ANE prefill overlapped with user reading)混合并发（ANE 预填充与用户阅读重叠）
Turn 1 (cold)第 1 轮（冷启动）	276 ms	27 ms	1,562 ms	303 ms	1,865 ms
Turn 2+ (warm)第 2 轮起（热）	hidden隐藏	27 ms	1,562 ms	27 ms	1,589 ms

*Prefill for turn N+1 executes on ANE while user reads turn N response. By the time user sends next message, prefill is complete. GPU decode uses pure MLX speed since it runs independently.*第 N+1 轮的预填充在 ANE 上执行，与用户阅读第 N 轮回复同时进行。用户发送下一条消息时，预填充已完成。GPU 解码使用纯 MLX 速度，因为独立运行。

3-turn total (Pure MLX):
3 × 1,663 = 4,989 ms
TTFT every turn: 101 ms3 轮总计（纯 MLX）：
3 × 1,663 = 4,989 ms
每轮 TTFT：101 ms

3-turn total (Hybrid):
1,865 + 2 × 1,589 = 5,043 ms
TTFT turn 2+: 27 ms (3.7× faster)3 轮总计（混合）：
1,865 + 2 × 1,589 = 5,043 ms
第 2 轮起 TTFT：27 ms（快 3.7 倍）

Total time is comparable, but hybrid delivers 3.7× faster TTFT from turn 2 onwards — the user sees the first token almost instantly. Meanwhile, the GPU's thermal budget is preserved since ANE handles prefill at only 1.8 W instead of 62 W.总时间相当，但混合模式从第 2 轮起 TTFT 快 3.7 倍 — 用户几乎瞬间看到首 token。同时，由于 ANE 仅以 1.8 W 而非 62 W 处理预填充，GPU 热预算得以保留。

Phase阶段	GPU (W)	ANE (W)	CPU (W)	Total总计
GPU prefill (compute-bound)GPU 预填充（计算密集）	62.05	0.00	2.34	64.39 W
GPU decode (bandwidth-bound)GPU 解码（带宽受限）	14.17	0.00	2.98	17.15 W
ANE prefill (private API)ANE 预填充（私有 API）	0.22	1.58	3.77	5.57 W
ANE pure decodeANE 纯解码	0.16	1.42	5.47	7.05 W

Machine机型	Mac Studio (2023)
Chip芯片	Apple M2 Ultra
CPUCPU	24 cores (16P + 8E)
GPU	76 cores
ANE	32 cores (31.6 TOPS)
Memory内存	192 GB unified
Bandwidth带宽	800 GB/s
Metal	Metal 4
OS	macOS 26 (Tahoe)

ANE Batch Prefill for On-Device Parallel LLM Inference ANE 批量预填充：端侧并行 LLM 推理

Key Findings核心发现

11.3× Prefill Speedup11.3 倍预填充加速

ANE Beats GPU on Short Prompts短提示词 ANE 快于 GPU

Concurrent Pipeline Feasibility并发流水线可行性

Negligible Transfer Overhead可忽略的状态迁移开销

Benchmark Results基准测试结果

Qwen3.5-0.8B FP16

Qwen3.5-2B BF16

Prefill Throughput: 0.8B预填充吞吐量：0.8B

Prefill Throughput: 2B预填充吞吐量：2B

Batch vs Sequential Dispatch (74 tokens)批量 vs 逐 token 派发（74 tokens）

State Transfer Overhead状态迁移开销

Architecture架构

Pure ANE

Pure MLX (GPU)

Hybrid (This Work)混合（本文）

Concurrent Pipeline并发流水线

Seamless Conversation Pipeline顺滑对话流水线

Multi-Turn Timing Comparison多轮对话时间对比

Power Analysis能耗分析

Test Hardware测试硬件