AtomGradient

MLXLayerStream

Layer-Streaming Offloading: Running 9B+ LLMs on 8GB Edge Devices

Per-layer weight streaming from NVMe storage enables models exceeding device memory to run on iPad and iPhone. 88% peak memory reduction with verified bandwidth scaling.

60–88% memory reduction: 27B model runs with 1.7 GB peak
9B-6bit OOM on 8GB iPad proves streaming necessity
iPad/iPhone TPS ratio = 1.92x matches 2x bandwidth ratio

Paper →

speculative-moe-research

Does Speculative Decoding Help Mixture-of-Experts?

306-run empirical study on Qwen3.5-35B-A3B MoE with Apple M2 Ultra. SD provides 1.18–1.30× speedup despite <4% acceptance through batch verification amortization of memory bandwidth.

1.30× MoE speedup (0.8B draft, γ=16, <0.2% acceptance)
Speedup scales with total params, not active params
New mechanism: batch verification amortization

Paper →

apple-silicon-llm-inference

Efficient On-Device LLM Inference on Apple Silicon: From Quantization to Speculative Decoding

Systematic benchmarking of 7 GGUF quantization levels and speculative decoding for Qwen3.5 on three Apple Silicon machines, establishing Q6_K as the Pareto-optimal choice and a ≥2.5× draft/target speed ratio as the SD viability rule.

Q6_K Pareto-optimal: 1.68× faster, 59% smaller, 0.54% PPL loss
+25.7% throughput via speculative decoding (0.8B→9B, k=4)
GGML_RPC cross-device SD: 79% overhead — not production-viable

Paper →

Prism

Cross-Domain Personal Data Integration on Consumer Hardware

Integrating finance, diet, mood, and reading data entirely on consumer Apple Silicon, producing emergent cross-domain insights with zero data leakage.

1.48x cross-domain insight emergence (IIR)
125.5x federation compression, zero data leakage
49.9 TPS real-time inference (35B on M2 Ultra)

Paper →

hybird-batch-prefill-on-ane

ANE Batch Prefill for On-Device Parallel LLM Inference

Fused matrix-vector kernels enabling concurrent ANE batch prefill + GPU decode on Apple Silicon for Qwen3.5 models.

11.3x ANE batch prefill speedup (268 tok/s)
79% power reduction for prefill component
<30 ms state transfer overhead

Paper →

hybrid-ane-mlx-bench

Disaggregated LLM Inference on Apple Silicon

Benchmarking CoreML ANE prefill + MLX GPU decode for Qwen3.5 on Apple Silicon, with four inference strategies compared.

ANE prefill matches GPU at ~410 tokens
282x GPU power reduction during prefill
4 inference pipelines benchmarked

Paper →

swift-qwen3-tts

On-Device Text-to-Speech

Native Swift implementation of Qwen3 TTS 0.6B for real-time, on-device speech synthesis.

67% model compression (2.35 GB → 808 MB)
Real-time synthesis (RTF 0.68x)
12 languages supported

Paper →

Gemma-Prune

On-Device Vision Language Model

Multi-stage compression pipeline for deploying Gemma 3 4B VLM on consumer hardware.

25% model compression (2.8 GB → 2.1 GB)
110 tok/s text generation
3.4x image processing speedup

Paper →

OptMLX

MLX Memory Optimization Research

Exploring memory optimization techniques for the MLX framework on Apple Silicon.

Up to 20x faster mmap loading
Zero-copy model loading
Comprehensive benchmarks

Paper →

Research

Layer-Streaming Offloading: Running 9B+ LLMs on 8GB Edge Devices

Does Speculative Decoding Help Mixture-of-Experts?

Efficient On-Device LLM Inference on Apple Silicon: From Quantization to Speculative Decoding

Cross-Domain Personal Data Integration on Consumer Hardware

ANE Batch Prefill for On-Device Parallel LLM Inference

Disaggregated LLM Inference on Apple Silicon

On-Device Text-to-Speech

On-Device Vision Language Model

MLX Memory Optimization Research

About