Layer-Streaming Offloading

Model模型	Quant量化	PPL	TPS	Memory内存
Qwen3.5-0.8B	8-bit	6.185	302.7	0.80 GB
Qwen3.5-2B	6-bit	5.398	214.3	1.49 GB
Qwen3.5-2B	8-bit	5.361	203.0	1.93 GB
Qwen3.5-4B	4-bit	4.959	148.1	2.34 GB
Qwen3.5-4B	6-bit	4.759	117.1	3.32 GB
Qwen3.5-9B	6-bit	4.485	77.9	6.90 GB
Qwen3.5-9B	8-bit	4.455	67.3	8.97 GB
Qwen3.5-27B	4-bit	4.010	35.0	14.34 GB

Model模型	Weights权重	iPad M3	iPhone A17	Ratio比值
0.8B-8bit	0.80 GB	103.2 TPS	53.5 TPS	1.93x
2B-6bit	1.49 GB	56.8 TPS	—	—
2B-8bit	1.93 GB	43.8 TPS	23.0 TPS	1.90x
4B-4bit	2.34 GB	35.5 TPS	18.4 TPS	1.93x
4B-6bit	3.32 GB	25.2 TPS	13.2 TPS	1.91x
9B-6bit	6.90 GB	OOM	—	—

Model模型	iPad M3	iPhone A17	Mem Savings内存节省
	Baseline基线	Stream流式	Baseline基线	Stream流式
0.8B-8bit	103.2 TPS / 799 MB	5.8 TPS / 293 MB	53.5 TPS / 792 MB	4.1 TPS / 293 MB	63%
4B-4bit	35.5 TPS / 2330 MB	2.3 TPS / 436 MB	18.4 TPS / 2317 MB	1.8 TPS / 436 MB	81%
9B-6bit	OOM / Crash	0.28 TPS / 1780 MB	—	—	—

Adaptive Hybrid Residency自适应混合驻留

Auto-compute optimal resident layer count based on available device memory, model size, and KV cache needs. Three priority modes trade off TPS vs context length. 基于可用设备内存、模型大小和 KV cache 需求，自动计算最优驻留层数。三种优先级模式在 TPS 和上下文长度之间权衡。

9B-6bit on iPad M3 (Memory-Constrained)9B-6bit 在 iPad M3 上（内存受限）

Model weights = 6.9 GB, device memory = 8 GB. Full model cannot fit — streaming is required. 模型权重 = 6.9 GB，设备内存 = 8 GB。模型无法完全载入 — 必须使用流式。

Mode模式	Resident驻留	TPS	Memory内存	Note备注
Full stream全流式	0/32	0.28	1,780 MB	Baseline基线
balanced	2/32	0.26	5,603 MB	Minimal benefit收益极小
maxContext	22/32	0.39	5,471 MB	Best: +39%最佳：+39%
maxTPS	24/32	0.03	5,802 MB	Swap thrashing!交换抖动！
Full resident全驻留	Skipped — need 7,220 MB, only 2,820 MB available跳过 — 需要 7,220 MB，仅剩 2,820 MB

maxTPS allocated 24 resident layers (5.8 GB), leaving too little free memory. iOS swap pressure caused 14x slowdown vs full streaming. The algorithm correctly identified the issue post-hoc — memory warnings triggered emergency layer shedding. maxTPS 分配了 24 层驻留（5.8 GB），剩余内存不足。iOS 交换压力导致比全流式慢 14 倍。算法在事后正确识别了问题 — 内存警告触发了紧急层释放。

4B-4bit Sweep: Finding the Optimal Balance (Memory-Sufficient)4B-4bit 扫描：寻找最佳平衡点（内存充足）

Model weights = 2.3 GB, device memory = 8 GB. Model fits entirely — what is the cost of streaming? 模型权重 = 2.3 GB，设备内存 = 8 GB。模型可完全载入 — 流式的代价是什么？

Resident驻留	iPad M3	vs Baselinevs 基线	iPhone A17	vs Baselinevs 基线	Memory内存
0/32 (full stream全流式)	2.3 TPS	6%	1.8 TPS	10%	436 MB
3/32 (10%)	3.1 TPS	9%	2.3 TPS	13%	~600 MB
8/32 (25%)	3.9 TPS	11%	3.0 TPS	16%	916 MB
16/32 (50%)	5.2 TPS	15%	3.8 TPS	21%	1,395 MB
24/32 (75%)	6.5 TPS	18%	4.7 TPS	25%	1,875 MB
28/32 (90%)	10.3 TPS	29%	6.0 TPS	33%	2,114 MB
32/32 (streaming 100%流式100%)	22.7 TPS	64%	5.9 TPS	32%	2,293 MB
Baseline (no streaming)基线（无流式）	35.5 TPS	100%	18.4 TPS	100%	2,330 MB

Even with all 32 layers resident, streaming architecture adds 36% overhead on iPad and 68% on iPhone vs baseline. The overhead comes from per-layer weight loading checks, memory pressure monitoring, and forced eval synchronization per layer. For models that fit in memory, standard loading is always better. 即使全部 32 层驻留，流式架构在 iPad 上增加 36% 开销，在 iPhone 上增加 68%。开销来自每层权重加载检查、内存压力监控和逐层强制 eval 同步。对于能放进内存的模型，标准加载永远更好。

Hybrid Residency: 9B-6bit Budget Sweep混合驻留：9B-6bit 预算扫描

Budget预算	Resident驻留	Streamed流式	iPad M3	iPhone A17	Memory内存
Full stream全流式	0/32	32	0.28 TPS	0.27 TPS	1.78 GB
2.0 GB	2/32	30	0.30 TPS	0.29 TPS	2.12 GB
3.0 GB	8/32	24	0.37 TPS	0.34 TPS	3.12 GB
3.5 GB	11/32	21	0.39 TPS	0.39 TPS	3.63 GB
4.0 GB	14/32	18	0.45 TPS (+61%)	0.45 TPS (+67%)	4.13 GB
5.0 GB	20/32	12	0.57 TPS	0.21 TPS*	5.13 GB

*iPhone 5GB: iOS memory pressure causes swap, TPS drops below full streaming. Recommended budget: 4 GB. *iPhone 5GB：iOS 内存压力导致交换，TPS 降至全流式以下。推荐预算：4 GB。

Key Insights: What We Learned关键洞见：我们学到了什么

We present our findings objectively — both the successes and the limitations. These results are measured on real devices, not simulated. 我们客观地呈现发现 — 包括成功和局限。这些结果均在真实设备上测量，非模拟。

CONFIRMED已验证

Layer-streaming makes impossible models possible层流式让不可能的模型变为可能

Qwen3.5-9B-6bit (6.9 GB) crashes on 8GB devices with standard loading. With layer-streaming, it runs at 0.28–0.39 TPS using only 1.78–5.5 GB. This is the core value proposition: access to models that otherwise cannot run at all. Qwen3.5-9B-6bit（6.9 GB）在 8GB 设备上标准加载会崩溃。通过层流式，仅用 1.78–5.5 GB 即可以 0.28–0.39 TPS 运行。这是核心价值：让原本完全无法运行的模型变得可用。

CONFIRMED已验证

TPS scales linearly with resident ratioTPS 与驻留比例线性缩放

Across all tested configurations, TPS increases approximately linearly with the number of resident layers. This validates the core formula: TPS = 1 / (T_compute + N_streamed × T_io_per_layer). Each additional resident layer saves one NVMe read per token. 在所有测试配置中，TPS 随驻留层数近似线性增长。这验证了核心公式：TPS = 1 / (T_compute + N_streamed × T_io_per_layer)。每多驻留一层，每个 token 节省一次 NVMe 读取。

CONFIRMED已验证

Dynamic shedding prevents OOM kills动态释放防止 OOM 崩溃

iOS memory warning notifications correctly trigger emergency layer shedding. In the 9B benchmark, the system received warnings and immediately released all 22 resident layers, dropping memory from 5.3 GB to 1.6 GB. The app survived instead of being killed. iOS 内存警告通知正确触发紧急层释放。在 9B 基准测试中，系统收到警告后立即释放全部 22 层驻留层，内存从 5.3 GB 降至 1.6 GB。App 存活而未被终止。

LIMITATION局限

Streaming overhead is massive — even at 100% residency流式开销巨大 — 即使 100% 驻留

With all 32 layers resident (no actual streaming), the streaming code path is still 36% slower than baseline on iPad (22.7 vs 35.5 TPS) and 68% slower on iPhone (5.9 vs 18.4 TPS). The overhead comes from per-layer eval synchronization, weight loading checks, and memory pressure monitoring. For models that fit in memory, standard loading is always better. 即使全部 32 层驻留（无实际流式），流式代码路径仍比基线在 iPad 上慢 36%（22.7 vs 35.5 TPS）、在 iPhone 上慢 68%（5.9 vs 18.4 TPS）。开销来自逐层 eval 同步、权重加载检查和内存压力监控。对于能放进内存的模型，标准加载永远更好。

LIMITATION局限

Over-aggressive residency causes swap thrashing过度驻留导致交换抖动

The maxTPS mode for 9B allocated 24 resident layers (5.8 GB) based on os_proc_available_memory() at startup. But during generation, this left too little headroom. iOS began swap-paging, and TPS dropped to 0.03 — 14x slower than full streaming (0.28 TPS). The available_memory API reports a snapshot, not a guarantee; other processes and system services reclaim memory unpredictably. maxTPS 模式为 9B 分配了 24 层驻留（5.8 GB），基于启动时的 os_proc_available_memory()。但在生成过程中，剩余空间不足。iOS 开始交换分页，TPS 降至 0.03 — 比全流式（0.28 TPS）慢 14 倍。available_memory API 报告的是快照而非保证；其他进程和系统服务会不可预测地回收内存。

LIMITATION局限

TPS prediction model is inaccurateTPS 预测模型不准确

The bandwidth-based formula TPS = 1 / (model_size/mem_bw + N_streamed × layer_size/nvme_bw) showed prediction errors from 50% to over 4000%. Full-stream prediction was 2x off; full-resident prediction was 2x too optimistic. The model doesn't account for: safetensors parsing overhead, per-layer eval synchronization cost, GPU compute efficiency (actual vs theoretical), and iOS memory management overhead. A more accurate model needs empirical calibration coefficients. 基于带宽的公式 TPS = 1 / (model_size/mem_bw + N_streamed × layer_size/nvme_bw) 预测误差从 50% 到超过 4000%。全流式预测偏差 2 倍；全驻留预测过于乐观 2 倍。模型未考虑：safetensors 解析开销、逐层 eval 同步成本、GPU 计算效率（实际 vs 理论）和 iOS 内存管理开销。更准确的模型需要经验校准系数。

NUANCED复杂

iPad vs iPhone: same technique, very different resultsiPad vs iPhone：相同技术，结果迥异

iPad M3 (10 GPU cores) benefits dramatically from high residency: 100% resident = 22.7 TPS (64% of baseline). iPhone A17 (6 GPU cores) shows diminishing returns: 100% resident = 5.9 TPS (32% of baseline), barely better than 90% resident (6.0 TPS). The per-layer eval overhead is a larger fraction of total time on iPhone due to fewer GPU cores and lower memory bandwidth. Optimal residency strategy is device-dependent. iPad M3（10 GPU 核心）从高驻留中获益显著：100% 驻留 = 22.7 TPS（基线的 64%）。iPhone A17（6 GPU 核心）收益递减：100% 驻留 = 5.9 TPS（基线的 32%），几乎不比 90% 驻留（6.0 TPS）好。由于 GPU 核心更少和内存带宽更低，逐层 eval 开销在 iPhone 上占总时间比例更大。最优驻留策略取决于设备。

NUANCED复杂

Qwen3.5 hybrid architecture helps KV cacheQwen3.5 混合架构有助于 KV cache

Qwen3.5 uses full_attention_interval=4: only 1/4 of layers have full attention (and thus KV cache). The other 3/4 are recurrent (GatedDeltaNet) with constant-size state. This means KV cache memory is ~25% of what a pure-attention model would need, leaving more room for resident layers. Our adaptive algorithm accounts for this via KVCacheConfig, but model-specific tuning is required. Qwen3.5 使用 full_attention_interval=4：只有 1/4 的层有完整注意力（因此有 KV cache）。其余 3/4 是循环层（GatedDeltaNet），状态大小恒定。这意味着 KV cache 内存约为纯注意力模型的 25%，为驻留层留出更多空间。我们的自适应算法通过 KVCacheConfig 考虑了这一点，但需要模型特定的调优。

Config配置	Resident驻留	TPS	Memory内存	I/O %
Qwen3.5-9B-8bit (32 layers)
Full Resident	32/32	68.3	8.97 GB	0%
Stream 0%	0/32	1.7	2.31 GB	81%
Hybrid 50%	16/32	3.2	5.73 GB	77%
Hybrid 75%	24/32	5.9	7.44 GB	70%
Qwen3.5-27B-4bit (64 layers)
Full Resident	64/64	35.6	14.34 GB	0%
Stream 0%	0/64	0.9	1.73 GB	80%
Hybrid 25%	16/64	1.1	4.91 GB	80%
Hybrid 50%	32/64	1.7	8.10 GB	77%

Layer-Streaming Offloading 按层流式卸载

Baseline Results — Mac Studio M2 Ultra基线结果 — Mac Studio M2 Ultra

Real Device Measurements真实设备测量

Bandwidth Scaling带宽缩放

9B Now Runs on 8GB!9B 现在可在 8GB 上运行！

Device Baselines (Fully Resident)设备基线（完全驻留）

Device Streaming Results设备流式结果

Adaptive Hybrid Residency自适应混合驻留

9B-6bit on iPad M3 (Memory-Constrained)9B-6bit 在 iPad M3 上（内存受限）

4B-4bit Sweep: Finding the Optimal Balance (Memory-Sufficient)4B-4bit 扫描：寻找最佳平衡点（内存充足）

Hybrid Residency: 9B-6bit Budget Sweep混合驻留：9B-6bit 预算扫描

Key Insights: What We Learned关键洞见：我们学到了什么

Layer-streaming makes impossible models possible层流式让不可能的模型变为可能

TPS scales linearly with resident ratioTPS 与驻留比例线性缩放

Dynamic shedding prevents OOM kills动态释放防止 OOM 崩溃

Streaming overhead is massive — even at 100% residency流式开销巨大 — 即使 100% 驻留

Over-aggressive residency causes swap thrashing过度驻留导致交换抖动

TPS prediction model is inaccurateTPS 预测模型不准确

iPad vs iPhone: same technique, very different resultsiPad vs iPhone：相同技术，结果迥异

Qwen3.5 hybrid architecture helps KV cacheQwen3.5 混合架构有助于 KV cache

Streaming: Memory vs TPS (Mac Studio)流式：内存 vs TPS（Mac Studio）