Apple M2 Ultra · MLX · Metal 4 · March 2026

H₂O Attention Score KV Cache Eviction
for On-Device LLM Inference
H₂O 注意力分数 KV Cache 淘汰
面向端侧 LLM 推理

Zero-overhead score export from fused Metal kernels — 2–3× better quality than sliding window 从融合 Metal 内核零开销导出注意力分数 — 质量损失比滑动窗口低 2–3 倍

+0.9%
Best PPL increase
(H₂O-heavy, 8B)
最优 PPL 增量
(H₂O-heavy, 8B)
+2.7%
Rotating PPL increase
(sliding window, 8B)
Rotating PPL 增量
(滑动窗口, 8B)
<0.3%
Speed overhead
(desktop, 200 tok)
速度开销
(桌面端, 200 tok)
16×
Memory compression
(H₂O + 8-bit KV quant)
内存压缩
(H₂O + 8-bit KV 量化)

Key Findings核心发现

In plain language: When an AI model generates text, it remembers all previous words in a "KV cache." On phones and laptops, memory is limited — you can't remember everything. The old approach (sliding window) simply forgets everything except the most recent words. Our approach is smarter: it tracks which words the model actually pays attention to, and keeps those important words while forgetting the less useful ones. The result? The AI stays almost as smart as having unlimited memory, while using far less. 大白话说:AI 模型生成文字时,需要"记住"之前所有的词(KV cache)。但手机和笔记本内存有限,不可能全记住。老方法(滑动窗口)简单粗暴 — 只保留最近的词,远处的全忘掉。我们的方法更聪明:追踪模型真正"关注"了哪些词,保留那些重要的词,丢掉不太有用的。结果?AI 几乎和不限内存一样聪明,但内存占用大幅减少。

2–3×

Better Than Sliding Window优于滑动窗口

H₂O-heavy consistently achieves 2–3× lower quality degradation than RotatingKVCache across all Qwen3 models. 在所有 Qwen3 模型上,H₂O-heavy 的质量退化始终比 RotatingKVCache 低 2–3 倍。

0 FLOPs

Zero Compute Overhead零计算开销

Scores exported via a single conditional float store in the fused Metal kernel. No extra matrix multiplication. 通过融合 Metal 内核中的单条条件 float 写入导出分数,无额外矩阵乘法。

+0.4%

Near-Lossless (20-sample)几乎无损 (20 样本)

H₂O-heavy with optimal budget achieves only +0.4% PPL increase in 20-sample evaluation (vs. +2.4% for Rotating). 在 20 样本评估中,最优 budget 的 H₂O-heavy 仅有 +0.4% PPL 增加(对比 Rotating +2.4%)。

Py + Swift

Cross-Platform跨平台

Full implementation in both Python (mlx-lm) and Swift (mlx-swift-lm). No model-specific changes needed. Python (mlx-lm) 和 Swift (mlx-swift-lm) 的完整实现,无需修改模型代码。

Benchmark Results基准测试结果

What to look for: "Perplexity" (PPL) measures how confused the model is — lower is better. The "Baseline" column is the gold standard (unlimited memory). We want H₂O's numbers to be as close to Baseline as possible. Notice how H₂O-heavy (red column) is always much closer to Baseline than Rotating (blue column). Speed should stay roughly the same — and it does (<0.3% difference). 看什么:"困惑度"(PPL)衡量模型有多困惑 — 越低越好。"Baseline" 列是金标准(不限内存)。我们希望 H₂O 的数字尽可能接近 Baseline。注意 H₂O-heavy(红色列)总是比 Rotating(蓝色列)更接近 Baseline。速度应该基本不变 — 实测确实如此(差异 <0.3%)。

3 Qwen3 models × 4 strategies × 3 runs. max_kv_size=256, 10 PPL samples, seq_len=512. 3 个 Qwen3 模型 × 4 种策略 × 3 轮运行。max_kv_size=256,10 PPL 样本,seq_len=512。

Perplexity (lower is better)困惑度(越低越好)

Model模型BaselineRotatingH₂O-balancedH₂O-heavy
Qwen3-4B-4bit7.707.94 (+3.2%)7.86 (+2.1%)7.81 (+1.4%)
Qwen3-8B-4bit6.266.43 (+2.7%)6.35 (+1.4%)6.32 (+0.9%)
Qwen3-8B-3bit7.627.89 (+3.5%)7.77 (+2.0%)7.73 (+1.5%)

Generation Speed (tokens/sec, 3-run avg)生成速度(tokens/秒,3次均值)

Model模型BaselineRotatingH₂O-balancedH₂O-heavy
Qwen3-4B-4bit102.6103.2 (+0.6%)102.5 (−0.1%)102.3 (−0.3%)
Qwen3-8B-4bit113.0112.9 (−0.1%)112.7 (−0.3%)112.7 (−0.3%)
Qwen3-8B-3bit112.3112.3 (−0.0%)112.2 (−0.1%)112.2 (−0.1%)

Peak Memory (GB)峰值内存 (GB)

Model模型BaselineRotatingH₂O-balancedH₂O-heavy
Qwen3-4B-4bit2.192.172.292.29
Qwen3-8B-4bit4.374.354.474.47
Qwen3-8B-3bit3.433.393.523.52

Architecture架构

The trick: Modern AI chips use a "fused kernel" that computes attention in one shot for speed. The problem? It calculates how much each word matters (the "score"), but immediately throws it away. We modified this kernel to save the score with just one extra instruction — like taking a photo of a receipt before throwing it away. This costs essentially nothing, but gives us the information we need to decide which words to keep. 核心技巧:现代 AI 芯片用"融合内核"一步算完注意力,速度很快。但问题是,它计算了每个词的重要程度("分数"),算完就丢了。我们修改了这个内核,只多加一条指令把分数存下来 — 就像丢收据前先拍张照。这几乎不花任何代价,却给了我们判断"该留哪些词"的关键信息。

Standard SDPA Path标准 SDPA 路径

Q, K, V
sdpa_vector (Metal)
Attention Output
Scores discarded分数丢弃

H₂O Path (+1 store)H₂O 路径 (+1 写入)

Q, K, V
sdpa_vector + scores_out
↓   ↓
Output
Scores分数
H₂O EvictionH₂O 淘汰

Budget Allocation AblationBudget 分配消融实验

The question: Given a fixed memory budget, should we keep more "important old words" or more "recent words"? Answer: keeping more important words wins. The "H₂O-heavy" row below allocates half the budget to high-attention words and half to recent ones — it achieves only +0.4% quality loss, which is 6× better than the sliding window approach (+2.4%). 问题:在固定的内存预算下,应该多保留"重要的旧词"还是多保留"最近的词"?答案:多保留重要词更好。下表中 "H₂O-heavy" 把一半预算给高注意力的词、一半给最近的词 — 只有 +0.4% 的质量损失,比滑动窗口 (+2.4%) 好 6 倍。

Qwen3-4B-4bit, 20 samples, max_kv=256, sink=4. Qwen3-4B-4bit,20 样本,max_kv=256,sink=4。

Strategy策略Heavy hRecent rPPLΔPPL%
Baseline9.082
Rotating2529.302+2.4%
H₂O-recent322209.189+1.2%
H₂O-balanced641889.184+1.1%
H₂O-heavy1281249.119+0.4%

Scaling Experiments缩放实验

The big picture: We tested two important questions: (1) How does quality change as we give the model more or less memory? (2) How does quality hold up as conversations get longer? The answer: H₂O always beats the sliding window, and the advantage is biggest when memory is tight or conversations are long — exactly the scenarios that matter most on phones. 全局视角:我们测试了两个关键问题:(1) 给模型更多或更少的内存,质量如何变化?(2) 对话越来越长时,质量还能保持吗?答案:H₂O 始终优于滑动窗口,而且在内存紧张或对话很长时优势最大 — 正是手机上最重要的场景。

KV Size Sweep (Qwen3-8B-4bit, seq=2048)KV Size 扫描 (Qwen3-8B-4bit, seq=2048)

max_kvBaseline PPLRotating (Δ%)H₂O-heavy (Δ%)
1283.5554.632 (+30.3%)4.496 (+26.5%)
2563.5554.231 (+19.0%)4.028 (+13.3%)
5123.5553.768 (+6.0%)3.764 (+5.9%)
10243.5553.617 (+1.8%)3.600 (+1.3%)
20483.5553.555 (+0.0%)3.555 (+0.0%)
0% 10% 20% 30% 128 256 512 1024 2048 max_kv_size 30.3% 26.5% 19.0% 13.3% Rotating H₂O-heavy PPL Increase (%)

Quality–memory tradeoff. H₂O-heavy (red) consistently below Rotating (blue) — biggest gap at small cache sizes. 质量-内存权衡曲线。H₂O-heavy(红)始终低于 Rotating(蓝)— 小 cache 时差距最大。

Sequence Length Sweep (Qwen3-8B-4bit, max_kv=256)序列长度扫描 (Qwen3-8B-4bit, max_kv=256)

seq_lenBaseline PPLRotating (Δ%)H₂O-heavy (Δ%)
5125.0895.183 (+1.8%)5.092 (+0.1%)
10245.5796.133 (+9.9%)5.967 (+7.0%)
20483.5554.231 (+19.0%)4.028 (+13.3%)
40963.3814.223 (+24.9%)4.045 (+19.7%)
81923.5794.595 (+28.4%)4.458 (+24.6%)
163843.4494.600 (+33.4%)4.462 (+29.4%)
327683.1024.272 (+37.7%)4.233 (+36.5%)
0% 10% 20% 30% 512 1K 2K 4K 8K 16K 32K Sequence Length (tokens) 1.8% 0.1% 37.7% 36.5% Rotating H₂O-heavy PPL Increase (%)

512 to 32K tokens (max_kv=256). H₂O-heavy maintains advantage at all lengths, largest at 16K (−4pp). At 512 tokens, H₂O is near-lossless (+0.1%). 512 到 32K token(max_kv=256)。H₂O-heavy 在所有长度保持优势,16K 时最大(−4pp)。512 token 时几乎无损(+0.1%)。

H₂O + KV QuantizationH₂O + KV 量化

Double the savings for free: Eviction shrinks the number of cached words. Quantization shrinks the size of each word's representation. Together, they multiply: 8× fewer entries × 2× smaller per entry = 16× total compression with zero additional quality loss (8-bit). This is like decluttering your closet AND vacuum-sealing what's left. 免费翻倍:淘汰减少了缓存的词数,量化缩小了每个词的存储大小。两者相乘:条目数减少 8 倍 × 每条缩小 2 倍 = 总共 16 倍压缩,且零额外质量损失(8-bit)。就像先整理衣柜,再把留下的真空压缩。

Strategy策略PPLΔPPL%Compression压缩倍数
Baseline (unlimited, FP16)6.261
Rotating (256, FP16)6.429+2.7%~8×
H₂O-heavy (256, FP16)6.320+0.9%~8×
H₂O-heavy (256, 8-bit)6.327+1.0%~16×
H₂O-heavy (256, mixed 8h+4r)6.420+2.5%~21×
H₂O-heavy (256, 4-bit)6.600+5.4%~32×

Quantization Across Cache Sizes不同缓存尺寸下的量化表现

Qwen3-8B-4bit, seq=2048, 5 samples. Baseline PPL = 3.555. Qwen3-8B-4bit,seq=2048,5 样本。Baseline PPL = 3.555。

max_kvRotatingH₂O FP16H₂O 8-bitH₂O 4-bit
2564.231 (+19.0%)4.028 (+13.3%)4.042 (+13.7%)4.193 (+17.9%)
5123.768 (+6.0%)3.764 (+5.9%)3.762 (+5.8%)3.974 (+11.8%)
10243.617 (+1.8%)3.600 (+1.3%)3.602 (+1.3%)3.782 (+6.4%)
0%

8-bit Is Free At All Sizes8-bit 在所有尺寸零损失

8-bit KV quantization tracks FP16 H₂O within 0.4pp at every cache size (256, 512, 1024). At 512+, 8-bit is statistically indistinguishable from FP16. 8-bit KV 量化在所有缓存尺寸(256、512、1024)下均与 FP16 H₂O 相差不超过 0.4pp。在 512+ 时,8-bit 与 FP16 在统计上无法区分。

21×

Mixed Precision: Best of Both混合精度:两全其美

8-bit for heavy-hitters + 4-bit for recent tokens: +2.5% PPL at ~21× compression. 24% less memory than uniform 8-bit, with only +1.5pp extra degradation. On Qwen3-4B, mixed (+2.1%) vs uniform 4-bit (+17.3%) — 8.4× better quality. Heavy-hitter 使用 8-bit + recent token 使用 4-bit:PPL 仅增 +2.5%,压缩达 ~21 倍。比均匀 8-bit 节省 24% 内存,仅多 +1.5pp 退化。在 Qwen3-4B 上,混合(+2.1%)vs 均匀 4-bit(+17.3%)—— 质量好 8.4 倍

32×

4-bit: Growing Gap4-bit:差距随尺寸增大

4-bit shows a widening gap vs FP16 as cache grows (+4.6pp at 256, +5.9pp at 512, +5.1pp at 1024), suggesting cumulative quantization error. Best reserved for extreme memory constraints. 4-bit 与 FP16 的差距随缓存增大而扩大(256 时 +4.6pp,512 时 +5.9pp,1024 时 +5.1pp),说明累积量化误差在放大。最适合极限内存场景。

Fused Quantized SDPA Kernel融合量化 SDPA 内核

The bottleneck wasn't attention — it was dequantization. Quantized KV caches previously required dequantizing the entire cache to FP16 every step before the attention kernel could read it. Our new Metal kernel (sdpa_vector_quantized) reads packed integer data directly and dequantizes per-thread in registers — zero intermediate buffers, zero extra kernel launches. This recovers 8.5 percentage points of throughput. 瓶颈不在注意力计算 — 而在反量化。量化 KV 缓存此前每步都需要将整个缓存反量化为 FP16,注意力内核才能读取。我们的新 Metal 内核(sdpa_vector_quantized)直接读取打包的整数数据,在寄存器中逐线程反量化 — 零中间缓冲区,零额外内核启动。吞吐量恢复 8.5 个百分点

Strategy策略TPSΔ vs Baselinevs 基线Δ vs H₂O Densevs H₂O Dense
Baseline (unlimited, FP16)112.4
H₂O-heavy (256, FP16)112.2−0.2%
H₂O+8bit (fused kernel)109.2−2.9%−2.4%
H₂O+4bit (fused kernel)108.9−3.2%−2.5%
H₂O+8bit (dequant, old path)99.6−11.4%−10.7%

Qwen3-8B-4bit, max_kv=256, 200 tokens, 3-run average. All strategies measured via unified stream_generate with prompt_cache. Qwen3-8B-4bit,max_kv=256,200 tokens,3 次平均。所有策略均通过统一的 stream_generate + prompt_cache 测量。

−2.9%

Near-Zero Overhead接近零开销

8-bit quantized KV with the fused kernel costs only −2.9% vs baseline. The old dequantize path cost −11.4%. That's an 8.5pp recovery. 使用融合内核的 8-bit 量化 KV 仅比基线慢 −2.9%。旧的反量化路径慢 −11.4%。恢复了 8.5 个百分点

0%

PPL UnchangedPPL 不变

Fused kernel PPL = 6.318 (+0.9%), identical to the dequantize path (6.327, +1.0%). Same quantization, same quality. 融合内核 PPL = 6.318 (+0.9%),与反量化路径(6.327,+1.0%)一致。相同量化,相同质量。

mx.quantize

Residual Cost: Quantize-on-Append残余开销:追加时量化

The remaining −2.9% comes from mx.quantize() per step (36 layers × each step). This is inherent to quantized KV storage — not the attention kernel. 剩余的 −2.9% 来自每步的 mx.quantize()(36 层 × 每步)。这是量化 KV 存储的固有成本 — 不在注意力内核的优化范围。

On-Device Deployment端侧部署

Does it actually work on a phone? Yes. We ran Qwen3-4B-4bit on a real iPhone 15 Pro Max and iPad Air M3. H₂O (dense) adds zero speed overhead on both devices (19.7 TPS on iPhone, 37.9 on iPad). H₂O+8bit (without fused kernel) incurs −8% on iPhone and −23% on iPad at max_kv=256 due to per-step dequantization. The fused quantized SDPA kernel (validated on desktop at −2.9%) is expected to substantially reduce these overheads once ported to Swift. 手机上能跑吗?能。我们在真实的 iPhone 15 Pro Max 和 iPad Air M3 上跑了 Qwen3-4B-4bit。H₂O(dense)在两台设备上均零速度开销(iPhone 19.7 TPS,iPad 37.9 TPS)。H₂O+8bit(未使用融合内核)在 max_kv=256 时因每步反量化开销,iPhone 降 −8%、iPad 降 −23%。融合量化 SDPA 内核(桌面端验证仅 −2.9%)移植到 Swift 后预计将大幅减少这些开销。

Strategy策略 iPhone 15 Pro Max (A17 Pro) iPad Air M3
TPSPeak MBTPSPeak MB
Baseline19.5222237.92239
Rotating(256)19.5220638.22216
H₂O(256)19.7234137.92342
H₂O+8bit(256)18.1233329.02328
iPhone 15 Pro Max benchmark

iPhone 15 Pro Max (A17 Pro, 8GB)

iPad Air M3 benchmark

iPad Air M3 (8GB)

0%

H₂O Dense: Zero OverheadH₂O Dense:零开销

H₂O on iPhone: 19.7 TPS (baseline 19.5). iPad: 37.9 TPS (baseline 37.9). Score export and eviction are completely invisible on both devices. H₂O 在 iPhone 上 19.7 TPS(基准 19.5),iPad 上 37.9 TPS(基准 37.9)。分数导出和淘汰在两台设备上完全无感。

−8% / −23%

H₂O+8bit: Dequant OverheadH₂O+8bit:反量化开销

iPhone: 18.1 TPS (−8%). iPad: 29.0 TPS (−23%). Per-step dequantization of 256-token cache is the bottleneck. Memory savings modest at this cache size (−14 MB). Better ROI at larger max_kv. iPhone:18.1 TPS(−8%)。iPad:29.0 TPS(−23%)。每步反量化 256 token 缓存成为瓶颈。此缓存尺寸下内存节省有限(−14 MB)。更大 max_kv 时性价比更好。

When to Use H₂O什么时候该用 H₂O

The honest answer: H₂O's value is NOT about speed — it's about quality under memory pressure. On a phone with 8GB RAM, the model weights alone take ~2.2GB. As conversations grow longer, the KV cache eats into remaining memory. Without any cache limit, you'll eventually crash. With a sliding window (Rotating), you survive but the AI "forgets" important early context. With H₂O, you survive AND the AI remembers what matters. 实话说:H₂O 的价值不在速度,在于内存受限时的质量。8GB 手机上,模型权重就占了 ~2.2GB。对话越长,KV 缓存越大。不限制缓存最终会崩溃;用滑动窗口(Rotating)不会崩但 AI 会"忘掉"重要的早期上下文;用 H₂O 不会崩,而且 AI 还记得关键信息。

Best For最适合

Long conversations (1K+ tokens) on memory-limited devices. Multi-turn chat, document QA, coding assistance — any task where the model needs to recall details from earlier in the conversation. 内存有限设备上的长对话(1K+ tokens)。多轮聊天、文档问答、编程辅助 — 任何需要模型回忆对话早期细节的场景。

Not Needed For不需要用的场景

Short conversations (<256 tokens) or devices with abundant memory (Mac with 32GB+). If the cache never exceeds your budget, H₂O and Baseline produce identical results. 短对话(<256 tokens)或内存充足的设备(32GB+ 的 Mac)。如果缓存永远不超预算,H₂O 和 Baseline 结果完全一样。

The Real Comparison真正的对比

Don't compare H₂O to unlimited Baseline — that's not a fair fight on a phone. Compare H₂O to Rotating (the only other option that prevents OOM). H₂O gives 3× less quality degradation at the same memory budget. 别拿 H₂O 和无限 Baseline 比 — 在手机上这不公平。应该拿 H₂O 和 Rotating(唯一能防 OOM 的替代方案)比。同等内存预算下,H₂O 的质量退化只有 Rotating 的 1/3

Recommended Config推荐配置

max_kv = context_length / 4, heavy_budget = max_kv / 2, sink = 4, recent = rest. For iPhone 8GB with Qwen3-4B-4bit: max_kv=512 supports ~2K token conversations with <6% PPL increase. max_kv = 上下文长度 / 4,heavy_budget = max_kv / 2,sink = 4,recent = 剩余。iPhone 8GB 跑 Qwen3-4B-4bit:max_kv=512 可支持 ~2K token 对话,PPL 增加 <6%。

Per-Layer Adaptive Budget Allocation逐层自适应 Budget 分配

We profiled per-layer attention patterns (entropy, concentration, variance) across all 36 layers and allocated heavy/recent budgets proportionally. Result: no significant improvement over uniform allocation — validating that uniform h=max/2 is near-optimal for Qwen3. 我们分析了全部 36 层的 attention 模式(熵、集中度、方差),按比例分配 heavy/recent budget。结果:相比均匀分配无显著改善 — 验证了 h=max/2 的均匀分配对 Qwen3 已近最优。

Strategy策略Qwen3-8B-4bitQwen3-4B-4bit
PPLΔ%PPLΔ%
Baseline6.2617.699
Rotating(256)6.429+2.68%7.945+3.20%
H₂O-uniform(256)6.320+0.94%7.807+1.40%
H₂O-adaptive(256)6.324+1.01%7.803+1.35%

Why Uniform Works为什么均匀分配已足够

Cross-layer concentration is highly uniform (0.25–0.33, std=0.02). H₂O's EMA score tracking is already inherently adaptive per-layer — each layer independently selects its most important tokens within the uniform budget. 跨层集中度高度一致(0.25–0.33, std=0.02)。H₂O 的 EMA 分数追踪机制本身已具有逐层自适应性 — 每层在均匀 budget 内独立选择最重要的 token。

Profiling Insight分析发现

Entropy varies more across layers (0.8–3.8): shallow layers (0–6) are dispersed, deep layers (29–35) are concentrated. But this variation does not translate into actionable budget differentiation — a negative result that itself validates the uniform approach. 熵在层间差异更大(0.8–3.8):浅层(0–6)注意力分散,深层(29–35)注意力集中。但这种差异无法转化为有效的 budget 分配 — 此负面结果本身验证了均匀分配的合理性。

Future Work未来工作

On-Device Fused Kernel Deployment端侧融合内核部署

The fused quantized SDPA kernel (sdpa_vector_quantized) is validated on desktop (−2.9% overhead). Porting to mlx-swift-lm requires registering the Metal kernel in the Swift build and adding C++ dispatch. We expect the −8% (iPhone) and −23% (iPad) dequantization overheads to be substantially reduced. 融合量化 SDPA 内核(sdpa_vector_quantized)已在桌面端验证(仅 −2.9% 开销)。移植到 mlx-swift-lm 需要在 Swift 构建中注册 Metal 内核并添加 C++ 调度。预计 iPhone 的 −8% 和 iPad 的 −23% 反量化开销将大幅降低。

Fused Quantize-on-Append融合追加量化

The remaining −2.9% throughput overhead comes from mx.quantize() per step per layer. A Metal kernel that directly writes quantized output during KV append would eliminate this last bottleneck. 剩余的 −2.9% 吞吐量开销来自每步每层的 mx.quantize()。一个在 KV 追加时直接写入量化输出的 Metal 内核可以消除这最后的瓶颈。

Larger-Scale Validation更大规模验证

Validate mixed-precision quantization (8h+4r) across more models, larger cache sizes, and longer contexts to establish general applicability beyond Qwen3. 在更多模型、更大缓存尺寸和更长上下文上验证混合精度量化(8h+4r),确认其在 Qwen3 之外的通用性。