Enabling concurrent ANE prefill and GPU decode on Apple Silicon via batched ANE dispatch with the private AppleNeuralEngine framework. 通过 AppleNeuralEngine 私有框架的批量派发,在 Apple Silicon 上实现 ANE 预填充与 GPU 解码的并发执行。
ANE batch prefill (32 tok/dispatch) achieves 268 tok/s vs 23.7 tok/s for sequential dispatch from our prior work, making private API prefill practical for interactive use.ANE 批量预填充(32 tok/次派发)达到 268 tok/s,而前作中逐 token 派发仅 23.7 tok/s,使私有 API 预填充可用于交互场景。
At 13 tokens, ANE batch prefill (149 tok/s) outperforms GPU prefill (138 tok/s) for 0.8B. This short-prompt advantage is critical for chatbots and code completion.在 13 token 时,0.8B 模型 ANE 批量预填充(149 tok/s)超越 GPU(138 tok/s)。短提示优势对聊天和代码补全至关重要。
ANE prefill uses only 0.22 W GPU power, freeing the GPU to simultaneously decode another request at ~90 tok/s. Zero GPU contention between prefill and decode.ANE 预填充仅消耗 0.22 W GPU 功耗,GPU 可同时对另一请求解码(~90 tok/s)。预填充与解码之间零 GPU 竞争。
State serialization (save + read + cache construction) totals <30 ms for both 0.8B and 2B. The binary format does not bottleneck the hybrid pipeline.状态序列化(保存+读取+缓存构建)对 0.8B 和 2B 总计 <30 ms。二进制格式不会成为混合流水线瓶颈。
All values averaged over 10 runs, temperature = 0, max 20 decode tokens. M2 Ultra, 192 GB.所有数值为 10 次运行的平均值,温度 = 0,最大解码 20 token。M2 Ultra, 192 GB。
| Prompt提示词 | Mode模式 | TokensToken | Prefill tok/s预填充 tok/s | Decode tok/s解码 tok/s | Total ms总耗时 ms |
|---|---|---|---|---|---|
| Short | ANE | 13 | 148.7 | 17.6 | 2,931 |
| MLX | 13 | 138.2 | 61.2 | 3,679 | |
| Hybrid | 13 | 148.3 | 39.4 | 6,050 | |
| Medium | ANE | 28 | 237.5 | 16.4 | 3,645 |
| MLX | 28 | 290.0 | 64.5 | 3,836 | |
| Hybrid | 28 | 237.9 | 53.9 | 6,280 | |
| Long | ANE | 74 | 268.0 | 17.1 | 3,774 |
| MLX | 74 | 736.2 | 64.1 | 3,846 | |
| Hybrid | 74 | 271.1 | 54.0 | 6,335 |
| Prompt提示词 | Mode模式 | TokensToken | Prefill tok/s预填充 tok/s | Decode tok/s解码 tok/s | Total ms总耗时 ms |
|---|---|---|---|---|---|
| Short | ANE | 13 | 91.7 | 12.4 | 5,271 |
| MLX | 13 | 179.6 | 90.8 | 4,076 | |
| Hybrid | 13 | 92.8 | 22.6 | 8,878 | |
| Medium | ANE | 28 | 160.4 | 12.5 | 6,189 |
| MLX | 28 | 358.0 | 94.1 | 4,212 | |
| Hybrid | 28 | 159.3 | 29.0 | 9,281 | |
| Long | ANE | 74 | 172.7 | 12.7 | 6,455 |
| MLX | 74 | 828.6 | 92.4 | 4,274 | |
| Hybrid | 74 | 173.1 | 29.2 | 9,518 |
| Model模型 | Dispatch Mode派发模式 | Prefill tok/s预填充 tok/s | Speedup加速比 | Source来源 |
|---|---|---|---|---|
| 0.8B | Sequential (1 tok/dispatch)逐 token(1 tok/次) | 23.7 | 1.0× | Prior work前作 |
| Batch (32 tok/dispatch)批量(32 tok/次) | 268.0 | 11.3× | This work本文 | |
| GPU baseline (MLX)GPU 基线(MLX) | 736.2 | 31.1× | This work本文 | |
| 2B | Sequential (1 tok/dispatch)逐 token(1 tok/次) | 23.7 | 1.0× | Prior work前作 |
| Batch (32 tok/dispatch)批量(32 tok/次) | 172.7 | 7.3× | This work本文 | |
| GPU baseline (MLX)GPU 基线(MLX) | 828.6 | 35.0× | This work本文 |
| Model模型 | Save保存 | Read读取 | Cache构建 | Total总计 |
|---|---|---|---|---|
| 0.8B | 12 ms | 9 ms | 6 ms | 27 ms |
| 2B | 11 ms | 9 ms | 6 ms | 26 ms |
Since ANE prefill uses only 0.22 W GPU power, the GPU can simultaneously decode another request:由于 ANE 预填充仅消耗 0.22 W GPU 功耗,GPU 可同时对另一请求进行解码:
Total ~15.8 W vs 76 W if both phases used GPU. 79% power reduction for the prefill component.总计约 15.8 W,若两阶段均用 GPU 则为 76 W。预填充部分功耗降低 79%。
In a single conversation, the ANE prefills the user's new input while the GPU is still decoding the previous response. The moment decode finishes, the next decode can begin immediately — zero wait time between turns.在单轮对话中,GPU 解码上一轮回复的同时,ANE 已开始预填充用户的新输入。解码结束的瞬间,下一轮解码即可立即开始 — 轮次之间零等待。
Key insight: In GPU-only mode, the user must wait for both prefill + decode sequentially each turn. With the hybrid pipeline, the ANE prefills the next input during the user's reading time, so TTFT for turn 2+ is just state transfer (~27 ms).核心洞察:纯 GPU 模式下,每轮用户都要等待预填充+解码串行完成。混合流水线中,ANE 在用户阅读期间预填充下一轮输入,因此第 2 轮起 TTFT 仅为状态迁移时间(~27 ms)。
Scenario: 3-turn conversation, each turn 74-token input, 100-token response. Qwen3.5-0.8B on M2 Ultra.场景:3 轮对话,每轮 74 token 输入,100 token 回复。Qwen3.5-0.8B,M2 Ultra。
| Prefill预填充 | Transfer迁移 | Decode (100 tok)解码 (100 tok) | TTFT | Total/turn每轮总计 | |
|---|---|---|---|---|---|
| Pure MLX (GPU only, sequential)纯 MLX(仅 GPU,串行) | |||||
| Every turn每一轮 | 101 ms | — | 1,562 ms | 101 ms | 1,663 ms |
| Hybrid Concurrent (ANE prefill overlapped with user reading)混合并发(ANE 预填充与用户阅读重叠) | |||||
| Turn 1 (cold)第 1 轮(冷启动) | 276 ms | 27 ms | 1,562 ms | 303 ms | 1,865 ms |
| Turn 2+ (warm)第 2 轮起(热) | hidden*隐藏* | 27 ms | 1,562 ms | 27 ms | 1,589 ms |
*Prefill for turn N+1 executes on ANE while user reads turn N response. By the time user sends next message, prefill is complete. GPU decode uses pure MLX speed since it runs independently.*第 N+1 轮的预填充在 ANE 上执行,与用户阅读第 N 轮回复同时进行。用户发送下一条消息时,预填充已完成。GPU 解码使用纯 MLX 速度,因为独立运行。
3-turn total (Pure MLX):
3 × 1,663 = 4,989 ms
TTFT every turn: 101 ms3 轮总计(纯 MLX):
3 × 1,663 = 4,989 ms
每轮 TTFT:101 ms
3-turn total (Hybrid):
1,865 + 2 × 1,589 = 5,043 ms
TTFT turn 2+: 27 ms (3.7× faster)3 轮总计(混合):
1,865 + 2 × 1,589 = 5,043 ms
第 2 轮起 TTFT:27 ms(快 3.7 倍)
Total time is comparable, but hybrid delivers 3.7× faster TTFT from turn 2 onwards — the user sees the first token almost instantly. Meanwhile, the GPU's thermal budget is preserved since ANE handles prefill at only 1.8 W instead of 62 W.总时间相当,但混合模式从第 2 轮起 TTFT 快 3.7 倍 — 用户几乎瞬间看到首 token。同时,由于 ANE 仅以 1.8 W 而非 62 W 处理预填充,GPU 热预算得以保留。
Power data from our prior work on the same M2 Ultra, measured via powermetrics at 100 ms sampling intervals.功耗数据来自前作,使用同一台 M2 Ultra,通过 powermetrics 以 100 ms 间隔采样。
| Phase阶段 | GPU (W) | ANE (W) | CPU (W) | Total总计 |
|---|---|---|---|---|
| GPU prefill (compute-bound)GPU 预填充(计算密集) | 62.05 | 0.00 | 2.34 | 64.39 W |
| GPU decode (bandwidth-bound)GPU 解码(带宽受限) | 14.17 | 0.00 | 2.98 | 17.15 W |
| ANE prefill (private API)ANE 预填充(私有 API) | 0.22 | 1.58 | 3.77 | 5.57 W |
| ANE pure decodeANE 纯解码 | 0.16 | 1.42 | 5.47 | 7.05 W |
Mobile implication: On A-series devices (TDP ~3–8 W), GPU prefill at 62 W immediately triggers thermal throttling. ANE prefill at ~1.8 W fits within the mobile thermal envelope, enabling sustained inference without clock-speed degradation.移动端意义:在 A 系列设备上(TDP 约 3–8 W),62 W 的 GPU 预填充会立即触发温控降频。ANE 预填充仅约 1.8 W,在移动端热设计功耗范围内,可实现持续推理而不降频。
| Machine机型 | Mac Studio (2023) |
| Chip芯片 | Apple M2 Ultra |
| CPUCPU | 24 cores (16P + 8E) |
| GPU | 76 cores |
| ANE | 32 cores (31.6 TOPS) |
| Memory内存 | 192 GB unified |
| Bandwidth带宽 | 800 GB/s |
| Metal | Metal 4 |
| OS | macOS 26 (Tahoe) |