Running 9B+ language models on 8GB edge devices by streaming layer weights from NVMe storage during inference. 通过推理过程中从 NVMe 存储流式加载层权重,在 8GB 边缘设备上运行 9B+ 语言模型。
| Model模型 | Quant量化 | PPL | TPS | Memory内存 |
|---|---|---|---|---|
| Qwen3.5-0.8B | 8-bit | 6.185 | 302.7 | 0.80 GB |
| Qwen3.5-2B | 6-bit | 5.398 | 214.3 | 1.49 GB |
| Qwen3.5-2B | 8-bit | 5.361 | 203.0 | 1.93 GB |
| Qwen3.5-4B | 4-bit | 4.959 | 148.1 | 2.34 GB |
| Qwen3.5-4B | 6-bit | 4.759 | 117.1 | 3.32 GB |
| Qwen3.5-9B | 6-bit | 4.485 | 77.9 | 6.90 GB |
| Qwen3.5-9B | 8-bit | 4.455 | 67.3 | 8.97 GB |
| Qwen3.5-27B | 4-bit | 4.010 | 35.0 | 14.34 GB |
Highlighted rows exceed 8GB device memory and require layer-streaming. 高亮行超过 8GB 设备内存,需要层流式加载。
iPad/iPhone TPS ratio = 1.90–1.93x across all models, matching the theoretical 2x memory bandwidth ratio (100 vs 50 GB/s). 所有模型的 iPad/iPhone TPS 比值 = 1.90–1.93x,完美匹配理论带宽比 2x(100 vs 50 GB/s)。
Qwen3.5-9B-6bit (6.9GB weights) crashes with standard loading, but runs at 0.39 TPS with adaptive residency on iPad Air M3. Qwen3.5-9B-6bit(6.9GB 权重)标准加载会崩溃,但通过自适应驻留以 0.39 TPS 运行于 iPad Air M3。
| Model模型 | Weights权重 | iPad M3 | iPhone A17 | Ratio比值 |
|---|---|---|---|---|
| 0.8B-8bit | 0.80 GB | 103.2 TPS | 53.5 TPS | 1.93x |
| 2B-6bit | 1.49 GB | 56.8 TPS | — | — |
| 2B-8bit | 1.93 GB | 43.8 TPS | 23.0 TPS | 1.90x |
| 4B-4bit | 2.34 GB | 35.5 TPS | 18.4 TPS | 1.93x |
| 4B-6bit | 3.32 GB | 25.2 TPS | 13.2 TPS | 1.91x |
| 9B-6bit | 6.90 GB | OOM | — | — |
| Model模型 | iPad M3 | iPhone A17 | Mem Savings内存节省 | ||
|---|---|---|---|---|---|
| Baseline基线 | Stream流式 | Baseline基线 | Stream流式 | ||
| 0.8B-8bit | 103.2 TPS / 799 MB | 5.8 TPS / 293 MB | 53.5 TPS / 792 MB | 4.1 TPS / 293 MB | 63% |
| 4B-4bit | 35.5 TPS / 2330 MB | 2.3 TPS / 436 MB | 18.4 TPS / 2317 MB | 1.8 TPS / 436 MB | 81% |
| 9B-6bit | OOM / Crash | 0.28 TPS / 1780 MB | — | — | — |
Auto-compute optimal resident layer count based on available device memory, model size, and KV cache needs. Three priority modes trade off TPS vs context length. 基于可用设备内存、模型大小和 KV cache 需求,自动计算最优驻留层数。三种优先级模式在 TPS 和上下文长度之间权衡。
Model weights = 6.9 GB, device memory = 8 GB. Full model cannot fit — streaming is required. 模型权重 = 6.9 GB,设备内存 = 8 GB。模型无法完全载入 — 必须使用流式。
| Mode模式 | Resident驻留 | TPS | Memory内存 | Note备注 |
|---|---|---|---|---|
| Full stream全流式 | 0/32 | 0.28 | 1,780 MB | Baseline基线 |
| balanced | 2/32 | 0.26 | 5,603 MB | Minimal benefit收益极小 |
| maxContext | 22/32 | 0.39 | 5,471 MB | Best: +39%最佳:+39% |
| maxTPS | 24/32 | 0.03 | 5,802 MB | Swap thrashing!交换抖动! |
| Full resident全驻留 | Skipped — need 7,220 MB, only 2,820 MB available跳过 — 需要 7,220 MB,仅剩 2,820 MB | |||
maxTPS allocated 24 resident layers (5.8 GB), leaving too little free memory. iOS swap pressure caused 14x slowdown vs full streaming. The algorithm correctly identified the issue post-hoc — memory warnings triggered emergency layer shedding. maxTPS 分配了 24 层驻留(5.8 GB),剩余内存不足。iOS 交换压力导致比全流式慢 14 倍。算法在事后正确识别了问题 — 内存警告触发了紧急层释放。
Model weights = 2.3 GB, device memory = 8 GB. Model fits entirely — what is the cost of streaming? 模型权重 = 2.3 GB,设备内存 = 8 GB。模型可完全载入 — 流式的代价是什么?
| Resident驻留 | iPad M3 | vs Baselinevs 基线 | iPhone A17 | vs Baselinevs 基线 | Memory内存 |
|---|---|---|---|---|---|
| 0/32 (full stream全流式) | 2.3 TPS | 6% | 1.8 TPS | 10% | 436 MB |
| 3/32 (10%) | 3.1 TPS | 9% | 2.3 TPS | 13% | ~600 MB |
| 8/32 (25%) | 3.9 TPS | 11% | 3.0 TPS | 16% | 916 MB |
| 16/32 (50%) | 5.2 TPS | 15% | 3.8 TPS | 21% | 1,395 MB |
| 24/32 (75%) | 6.5 TPS | 18% | 4.7 TPS | 25% | 1,875 MB |
| 28/32 (90%) | 10.3 TPS | 29% | 6.0 TPS | 33% | 2,114 MB |
| 32/32 (streaming 100%流式100%) | 22.7 TPS | 64% | 5.9 TPS | 32% | 2,293 MB |
| Baseline (no streaming)基线(无流式) | 35.5 TPS | 100% | 18.4 TPS | 100% | 2,330 MB |
Even with all 32 layers resident, streaming architecture adds 36% overhead on iPad and 68% on iPhone vs baseline. The overhead comes from per-layer weight loading checks, memory pressure monitoring, and forced eval synchronization per layer. For models that fit in memory, standard loading is always better. 即使全部 32 层驻留,流式架构在 iPad 上增加 36% 开销,在 iPhone 上增加 68%。开销来自每层权重加载检查、内存压力监控和逐层强制 eval 同步。对于能放进内存的模型,标准加载永远更好。
| Budget预算 | Resident驻留 | Streamed流式 | iPad M3 | iPhone A17 | Memory内存 |
|---|---|---|---|---|---|
| Full stream全流式 | 0/32 | 32 | 0.28 TPS | 0.27 TPS | 1.78 GB |
| 2.0 GB | 2/32 | 30 | 0.30 TPS | 0.29 TPS | 2.12 GB |
| 3.0 GB | 8/32 | 24 | 0.37 TPS | 0.34 TPS | 3.12 GB |
| 3.5 GB | 11/32 | 21 | 0.39 TPS | 0.39 TPS | 3.63 GB |
| 4.0 GB | 14/32 | 18 | 0.45 TPS (+61%) | 0.45 TPS (+67%) | 4.13 GB |
| 5.0 GB | 20/32 | 12 | 0.57 TPS | 0.21 TPS* | 5.13 GB |
*iPhone 5GB: iOS memory pressure causes swap, TPS drops below full streaming. Recommended budget: 4 GB. *iPhone 5GB:iOS 内存压力导致交换,TPS 降至全流式以下。推荐预算:4 GB。
We present our findings objectively — both the successes and the limitations. These results are measured on real devices, not simulated. 我们客观地呈现发现 — 包括成功和局限。这些结果均在真实设备上测量,非模拟。
Qwen3.5-9B-6bit (6.9 GB) crashes on 8GB devices with standard loading. With layer-streaming, it runs at 0.28–0.39 TPS using only 1.78–5.5 GB. This is the core value proposition: access to models that otherwise cannot run at all. Qwen3.5-9B-6bit(6.9 GB)在 8GB 设备上标准加载会崩溃。通过层流式,仅用 1.78–5.5 GB 即可以 0.28–0.39 TPS 运行。这是核心价值:让原本完全无法运行的模型变得可用。
Across all tested configurations, TPS increases approximately linearly with the number of resident layers. This validates the core formula: TPS = 1 / (T_compute + N_streamed × T_io_per_layer). Each additional resident layer saves one NVMe read per token.
在所有测试配置中,TPS 随驻留层数近似线性增长。这验证了核心公式:TPS = 1 / (T_compute + N_streamed × T_io_per_layer)。每多驻留一层,每个 token 节省一次 NVMe 读取。
iOS memory warning notifications correctly trigger emergency layer shedding. In the 9B benchmark, the system received warnings and immediately released all 22 resident layers, dropping memory from 5.3 GB to 1.6 GB. The app survived instead of being killed. iOS 内存警告通知正确触发紧急层释放。在 9B 基准测试中,系统收到警告后立即释放全部 22 层驻留层,内存从 5.3 GB 降至 1.6 GB。App 存活而未被终止。
With all 32 layers resident (no actual streaming), the streaming code path is still 36% slower than baseline on iPad (22.7 vs 35.5 TPS) and 68% slower on iPhone (5.9 vs 18.4 TPS). The overhead comes from per-layer eval synchronization, weight loading checks, and memory pressure monitoring. For models that fit in memory, standard loading is always better. 即使全部 32 层驻留(无实际流式),流式代码路径仍比基线在 iPad 上慢 36%(22.7 vs 35.5 TPS)、在 iPhone 上慢 68%(5.9 vs 18.4 TPS)。开销来自逐层 eval 同步、权重加载检查和内存压力监控。对于能放进内存的模型,标准加载永远更好。
The maxTPS mode for 9B allocated 24 resident layers (5.8 GB) based on os_proc_available_memory() at startup. But during generation, this left too little headroom. iOS began swap-paging, and TPS dropped to 0.03 — 14x slower than full streaming (0.28 TPS). The available_memory API reports a snapshot, not a guarantee; other processes and system services reclaim memory unpredictably.
maxTPS 模式为 9B 分配了 24 层驻留(5.8 GB),基于启动时的 os_proc_available_memory()。但在生成过程中,剩余空间不足。iOS 开始交换分页,TPS 降至 0.03 — 比全流式(0.28 TPS)慢 14 倍。available_memory API 报告的是快照而非保证;其他进程和系统服务会不可预测地回收内存。
The bandwidth-based formula TPS = 1 / (model_size/mem_bw + N_streamed × layer_size/nvme_bw) showed prediction errors from 50% to over 4000%. Full-stream prediction was 2x off; full-resident prediction was 2x too optimistic. The model doesn't account for: safetensors parsing overhead, per-layer eval synchronization cost, GPU compute efficiency (actual vs theoretical), and iOS memory management overhead. A more accurate model needs empirical calibration coefficients.
基于带宽的公式 TPS = 1 / (model_size/mem_bw + N_streamed × layer_size/nvme_bw) 预测误差从 50% 到超过 4000%。全流式预测偏差 2 倍;全驻留预测过于乐观 2 倍。模型未考虑:safetensors 解析开销、逐层 eval 同步成本、GPU 计算效率(实际 vs 理论)和 iOS 内存管理开销。更准确的模型需要经验校准系数。
iPad M3 (10 GPU cores) benefits dramatically from high residency: 100% resident = 22.7 TPS (64% of baseline). iPhone A17 (6 GPU cores) shows diminishing returns: 100% resident = 5.9 TPS (32% of baseline), barely better than 90% resident (6.0 TPS). The per-layer eval overhead is a larger fraction of total time on iPhone due to fewer GPU cores and lower memory bandwidth. Optimal residency strategy is device-dependent. iPad M3(10 GPU 核心)从高驻留中获益显著:100% 驻留 = 22.7 TPS(基线的 64%)。iPhone A17(6 GPU 核心)收益递减:100% 驻留 = 5.9 TPS(基线的 32%),几乎不比 90% 驻留(6.0 TPS)好。由于 GPU 核心更少和内存带宽更低,逐层 eval 开销在 iPhone 上占总时间比例更大。最优驻留策略取决于设备。
Qwen3.5 uses full_attention_interval=4: only 1/4 of layers have full attention (and thus KV cache). The other 3/4 are recurrent (GatedDeltaNet) with constant-size state. This means KV cache memory is ~25% of what a pure-attention model would need, leaving more room for resident layers. Our adaptive algorithm accounts for this via KVCacheConfig, but model-specific tuning is required.
Qwen3.5 使用 full_attention_interval=4:只有 1/4 的层有完整注意力(因此有 KV cache)。其余 3/4 是循环层(GatedDeltaNet),状态大小恒定。这意味着 KV cache 内存约为纯注意力模型的 25%,为驻留层留出更多空间。我们的自适应算法通过 KVCacheConfig 考虑了这一点,但需要模型特定的调优。
| Config配置 | Resident驻留 | TPS | Memory内存 | I/O % |
|---|---|---|---|---|
| Qwen3.5-9B-8bit (32 layers) | ||||
| Full Resident | 32/32 | 68.3 | 8.97 GB | 0% |
| Stream 0% | 0/32 | 1.7 | 2.31 GB | 81% |
| Hybrid 50% | 16/32 | 3.2 | 5.73 GB | 77% |
| Hybrid 75% | 24/32 | 5.9 | 7.44 GB | 70% |
| Qwen3.5-27B-4bit (64 layers) | ||||
| Full Resident | 64/64 | 35.6 | 14.34 GB | 0% |
| Stream 0% | 0/64 | 0.9 | 1.73 GB | 80% |
| Hybrid 25% | 16/64 | 1.1 | 4.91 GB | 80% |
| Hybrid 50% | 32/64 | 1.7 | 8.10 GB | 77% |