Layer-Streaming Offloading 按层流式卸载

Running 9B+ language models on 8GB edge devices by streaming layer weights from NVMe storage during inference. 通过推理过程中从 NVMe 存储流式加载层权重,在 8GB 边缘设备上运行 9B+ 语言模型。

88%
Memory Reduction内存减少
1.92x
BW Scaling Verified带宽缩放验证
9B
Runs on 8GB iPad在 8GB iPad 上运行
0.39
TPS (9B Adaptive)TPS(9B 自适应)

Baseline Results — Mac Studio M2 Ultra基线结果 — Mac Studio M2 Ultra

Model模型Quant量化PPLTPSMemory内存
Qwen3.5-0.8B8-bit6.185302.70.80 GB
Qwen3.5-2B6-bit5.398214.31.49 GB
Qwen3.5-2B8-bit5.361203.01.93 GB
Qwen3.5-4B4-bit4.959148.12.34 GB
Qwen3.5-4B6-bit4.759117.13.32 GB
Qwen3.5-9B6-bit4.48577.96.90 GB
Qwen3.5-9B8-bit4.45567.38.97 GB
Qwen3.5-27B4-bit4.01035.014.34 GB

Highlighted rows exceed 8GB device memory and require layer-streaming. 高亮行超过 8GB 设备内存,需要层流式加载。

Real Device Measurements真实设备测量

Bandwidth Scaling带宽缩放

iPad/iPhone TPS ratio = 1.90–1.93x across all models, matching the theoretical 2x memory bandwidth ratio (100 vs 50 GB/s). 所有模型的 iPad/iPhone TPS 比值 = 1.90–1.93x,完美匹配理论带宽比 2x(100 vs 50 GB/s)。

9B Now Runs on 8GB!9B 现在可在 8GB 上运行!

Qwen3.5-9B-6bit (6.9GB weights) crashes with standard loading, but runs at 0.39 TPS with adaptive residency on iPad Air M3. Qwen3.5-9B-6bit(6.9GB 权重)标准加载会崩溃,但通过自适应驻留以 0.39 TPS 运行于 iPad Air M3。

Device Baselines (Fully Resident)设备基线(完全驻留)

Model模型Weights权重iPad M3iPhone A17Ratio比值
0.8B-8bit0.80 GB103.2 TPS53.5 TPS1.93x
2B-6bit1.49 GB56.8 TPS
2B-8bit1.93 GB43.8 TPS23.0 TPS1.90x
4B-4bit2.34 GB35.5 TPS18.4 TPS1.93x
4B-6bit3.32 GB25.2 TPS13.2 TPS1.91x
9B-6bit6.90 GBOOM

Device Streaming Results设备流式结果

Model模型iPad M3iPhone A17Mem Savings内存节省
Baseline基线Stream流式Baseline基线Stream流式
0.8B-8bit103.2 TPS / 799 MB5.8 TPS / 293 MB53.5 TPS / 792 MB4.1 TPS / 293 MB63%
4B-4bit35.5 TPS / 2330 MB2.3 TPS / 436 MB18.4 TPS / 2317 MB1.8 TPS / 436 MB81%
9B-6bitOOM / Crash0.28 TPS / 1780 MB

Adaptive Hybrid Residency自适应混合驻留

Auto-compute optimal resident layer count based on available device memory, model size, and KV cache needs. Three priority modes trade off TPS vs context length. 基于可用设备内存、模型大小和 KV cache 需求,自动计算最优驻留层数。三种优先级模式在 TPS 和上下文长度之间权衡。

9B-6bit on iPad M3 (Memory-Constrained)9B-6bit 在 iPad M3 上(内存受限)

Model weights = 6.9 GB, device memory = 8 GB. Full model cannot fit — streaming is required. 模型权重 = 6.9 GB,设备内存 = 8 GB。模型无法完全载入 — 必须使用流式。

Mode模式Resident驻留TPSMemory内存Note备注
Full stream全流式0/320.281,780 MBBaseline基线
balanced2/320.265,603 MBMinimal benefit收益极小
maxContext22/320.395,471 MBBest: +39%最佳:+39%
maxTPS24/320.035,802 MBSwap thrashing!交换抖动!
Full resident全驻留Skipped — need 7,220 MB, only 2,820 MB available跳过 — 需要 7,220 MB,仅剩 2,820 MB

maxTPS allocated 24 resident layers (5.8 GB), leaving too little free memory. iOS swap pressure caused 14x slowdown vs full streaming. The algorithm correctly identified the issue post-hoc — memory warnings triggered emergency layer shedding. maxTPS 分配了 24 层驻留(5.8 GB),剩余内存不足。iOS 交换压力导致比全流式慢 14 倍。算法在事后正确识别了问题 — 内存警告触发了紧急层释放。

4B-4bit Sweep: Finding the Optimal Balance (Memory-Sufficient)4B-4bit 扫描:寻找最佳平衡点(内存充足)

Model weights = 2.3 GB, device memory = 8 GB. Model fits entirely — what is the cost of streaming? 模型权重 = 2.3 GB,设备内存 = 8 GB。模型可完全载入 — 流式的代价是什么?

Resident驻留iPad M3vs Baselinevs 基线iPhone A17vs Baselinevs 基线Memory内存
0/32 (full stream全流式)2.3 TPS6%1.8 TPS10%436 MB
3/32 (10%)3.1 TPS9%2.3 TPS13%~600 MB
8/32 (25%)3.9 TPS11%3.0 TPS16%916 MB
16/32 (50%)5.2 TPS15%3.8 TPS21%1,395 MB
24/32 (75%)6.5 TPS18%4.7 TPS25%1,875 MB
28/32 (90%)10.3 TPS29%6.0 TPS33%2,114 MB
32/32 (streaming 100%流式100%)22.7 TPS64%5.9 TPS32%2,293 MB
Baseline (no streaming)基线(无流式)35.5 TPS100%18.4 TPS100%2,330 MB

Even with all 32 layers resident, streaming architecture adds 36% overhead on iPad and 68% on iPhone vs baseline. The overhead comes from per-layer weight loading checks, memory pressure monitoring, and forced eval synchronization per layer. For models that fit in memory, standard loading is always better. 即使全部 32 层驻留,流式架构在 iPad 上增加 36% 开销,在 iPhone 上增加 68%。开销来自每层权重加载检查、内存压力监控和逐层强制 eval 同步。对于能放进内存的模型,标准加载永远更好。

Hybrid Residency: 9B-6bit Budget Sweep混合驻留:9B-6bit 预算扫描

Budget预算Resident驻留Streamed流式iPad M3iPhone A17Memory内存
Full stream全流式0/32320.28 TPS0.27 TPS1.78 GB
2.0 GB2/32300.30 TPS0.29 TPS2.12 GB
3.0 GB8/32240.37 TPS0.34 TPS3.12 GB
3.5 GB11/32210.39 TPS0.39 TPS3.63 GB
4.0 GB14/32180.45 TPS (+61%)0.45 TPS (+67%)4.13 GB
5.0 GB20/32120.57 TPS0.21 TPS*5.13 GB

*iPhone 5GB: iOS memory pressure causes swap, TPS drops below full streaming. Recommended budget: 4 GB. *iPhone 5GB:iOS 内存压力导致交换,TPS 降至全流式以下。推荐预算:4 GB

Key Insights: What We Learned关键洞见:我们学到了什么

We present our findings objectively — both the successes and the limitations. These results are measured on real devices, not simulated. 我们客观地呈现发现 — 包括成功和局限。这些结果均在真实设备上测量,非模拟。

CONFIRMED已验证

Layer-streaming makes impossible models possible层流式让不可能的模型变为可能

Qwen3.5-9B-6bit (6.9 GB) crashes on 8GB devices with standard loading. With layer-streaming, it runs at 0.28–0.39 TPS using only 1.78–5.5 GB. This is the core value proposition: access to models that otherwise cannot run at all. Qwen3.5-9B-6bit(6.9 GB)在 8GB 设备上标准加载会崩溃。通过层流式,仅用 1.78–5.5 GB 即可以 0.28–0.39 TPS 运行。这是核心价值:让原本完全无法运行的模型变得可用

CONFIRMED已验证

TPS scales linearly with resident ratioTPS 与驻留比例线性缩放

Across all tested configurations, TPS increases approximately linearly with the number of resident layers. This validates the core formula: TPS = 1 / (T_compute + N_streamed × T_io_per_layer). Each additional resident layer saves one NVMe read per token. 在所有测试配置中,TPS 随驻留层数近似线性增长。这验证了核心公式:TPS = 1 / (T_compute + N_streamed × T_io_per_layer)。每多驻留一层,每个 token 节省一次 NVMe 读取。

CONFIRMED已验证

Dynamic shedding prevents OOM kills动态释放防止 OOM 崩溃

iOS memory warning notifications correctly trigger emergency layer shedding. In the 9B benchmark, the system received warnings and immediately released all 22 resident layers, dropping memory from 5.3 GB to 1.6 GB. The app survived instead of being killed. iOS 内存警告通知正确触发紧急层释放。在 9B 基准测试中,系统收到警告后立即释放全部 22 层驻留层,内存从 5.3 GB 降至 1.6 GB。App 存活而未被终止。

LIMITATION局限

Streaming overhead is massive — even at 100% residency流式开销巨大 — 即使 100% 驻留

With all 32 layers resident (no actual streaming), the streaming code path is still 36% slower than baseline on iPad (22.7 vs 35.5 TPS) and 68% slower on iPhone (5.9 vs 18.4 TPS). The overhead comes from per-layer eval synchronization, weight loading checks, and memory pressure monitoring. For models that fit in memory, standard loading is always better. 即使全部 32 层驻留(无实际流式),流式代码路径仍比基线在 iPad 上慢 36%(22.7 vs 35.5 TPS)、在 iPhone 上慢 68%(5.9 vs 18.4 TPS)。开销来自逐层 eval 同步、权重加载检查和内存压力监控。对于能放进内存的模型,标准加载永远更好。

LIMITATION局限

Over-aggressive residency causes swap thrashing过度驻留导致交换抖动

The maxTPS mode for 9B allocated 24 resident layers (5.8 GB) based on os_proc_available_memory() at startup. But during generation, this left too little headroom. iOS began swap-paging, and TPS dropped to 0.03 — 14x slower than full streaming (0.28 TPS). The available_memory API reports a snapshot, not a guarantee; other processes and system services reclaim memory unpredictably. maxTPS 模式为 9B 分配了 24 层驻留(5.8 GB),基于启动时的 os_proc_available_memory()。但在生成过程中,剩余空间不足。iOS 开始交换分页,TPS 降至 0.03 — 比全流式(0.28 TPS)慢 14 倍。available_memory API 报告的是快照而非保证;其他进程和系统服务会不可预测地回收内存。

LIMITATION局限

TPS prediction model is inaccurateTPS 预测模型不准确

The bandwidth-based formula TPS = 1 / (model_size/mem_bw + N_streamed × layer_size/nvme_bw) showed prediction errors from 50% to over 4000%. Full-stream prediction was 2x off; full-resident prediction was 2x too optimistic. The model doesn't account for: safetensors parsing overhead, per-layer eval synchronization cost, GPU compute efficiency (actual vs theoretical), and iOS memory management overhead. A more accurate model needs empirical calibration coefficients. 基于带宽的公式 TPS = 1 / (model_size/mem_bw + N_streamed × layer_size/nvme_bw) 预测误差从 50% 到超过 4000%。全流式预测偏差 2 倍;全驻留预测过于乐观 2 倍。模型未考虑:safetensors 解析开销、逐层 eval 同步成本、GPU 计算效率(实际 vs 理论)和 iOS 内存管理开销。更准确的模型需要经验校准系数。

NUANCED复杂

iPad vs iPhone: same technique, very different resultsiPad vs iPhone:相同技术,结果迥异

iPad M3 (10 GPU cores) benefits dramatically from high residency: 100% resident = 22.7 TPS (64% of baseline). iPhone A17 (6 GPU cores) shows diminishing returns: 100% resident = 5.9 TPS (32% of baseline), barely better than 90% resident (6.0 TPS). The per-layer eval overhead is a larger fraction of total time on iPhone due to fewer GPU cores and lower memory bandwidth. Optimal residency strategy is device-dependent. iPad M3(10 GPU 核心)从高驻留中获益显著:100% 驻留 = 22.7 TPS(基线的 64%)。iPhone A17(6 GPU 核心)收益递减:100% 驻留 = 5.9 TPS(基线的 32%),几乎不比 90% 驻留(6.0 TPS)好。由于 GPU 核心更少和内存带宽更低,逐层 eval 开销在 iPhone 上占总时间比例更大。最优驻留策略取决于设备。

NUANCED复杂

Qwen3.5 hybrid architecture helps KV cacheQwen3.5 混合架构有助于 KV cache

Qwen3.5 uses full_attention_interval=4: only 1/4 of layers have full attention (and thus KV cache). The other 3/4 are recurrent (GatedDeltaNet) with constant-size state. This means KV cache memory is ~25% of what a pure-attention model would need, leaving more room for resident layers. Our adaptive algorithm accounts for this via KVCacheConfig, but model-specific tuning is required. Qwen3.5 使用 full_attention_interval=4:只有 1/4 的层有完整注意力(因此有 KV cache)。其余 3/4 是循环层(GatedDeltaNet),状态大小恒定。这意味着 KV cache 内存约为纯注意力模型的 25%,为驻留层留出更多空间。我们的自适应算法通过 KVCacheConfig 考虑了这一点,但需要模型特定的调优。

Streaming: Memory vs TPS (Mac Studio)流式:内存 vs TPS(Mac Studio)

Config配置Resident驻留TPSMemory内存I/O %
Qwen3.5-9B-8bit (32 layers)
Full Resident32/3268.38.97 GB0%
Stream 0%0/321.72.31 GB81%
Hybrid 50%16/323.25.73 GB77%
Hybrid 75%24/325.97.44 GB70%
Qwen3.5-27B-4bit (64 layers)
Full Resident64/6435.614.34 GB0%
Stream 0%0/640.91.73 GB80%
Hybrid 25%16/641.14.91 GB80%
Hybrid 50%32/641.78.10 GB77%