AtomGradient

Bringing AI to the Edge

We focus on deploying AI models directly on consumer devices — making AI faster, private, and freely accessible to everyone.

Research

MLXLayerStream

Layer-Streaming Offloading: Running 9B+ LLMs on 8GB Edge Devices

Per-layer weight streaming from NVMe storage enables models exceeding device memory to run on iPad and iPhone. 88% peak memory reduction with verified bandwidth scaling.

  • 60–88% memory reduction: 27B model runs with 1.7 GB peak
  • 9B-6bit OOM on 8GB iPad proves streaming necessity
  • iPad/iPhone TPS ratio = 1.92x matches 2x bandwidth ratio
speculative-moe-research

Does Speculative Decoding Help Mixture-of-Experts?

306-run empirical study on Qwen3.5-35B-A3B MoE with Apple M2 Ultra. SD provides 1.18–1.30× speedup despite <4% acceptance through batch verification amortization of memory bandwidth.

  • 1.30× MoE speedup (0.8B draft, γ=16, <0.2% acceptance)
  • Speedup scales with total params, not active params
  • New mechanism: batch verification amortization
apple-silicon-llm-inference

Efficient On-Device LLM Inference on Apple Silicon: From Quantization to Speculative Decoding

Systematic benchmarking of 7 GGUF quantization levels and speculative decoding for Qwen3.5 on three Apple Silicon machines, establishing Q6_K as the Pareto-optimal choice and a ≥2.5× draft/target speed ratio as the SD viability rule.

  • Q6_K Pareto-optimal: 1.68× faster, 59% smaller, 0.54% PPL loss
  • +25.7% throughput via speculative decoding (0.8B→9B, k=4)
  • GGML_RPC cross-device SD: 79% overhead — not production-viable
Prism

Cross-Domain Personal Data Integration on Consumer Hardware

Integrating finance, diet, mood, and reading data entirely on consumer Apple Silicon, producing emergent cross-domain insights with zero data leakage.

  • 1.48x cross-domain insight emergence (IIR)
  • 125.5x federation compression, zero data leakage
  • 49.9 TPS real-time inference (35B on M2 Ultra)
hybird-batch-prefill-on-ane

ANE Batch Prefill for On-Device Parallel LLM Inference

Fused matrix-vector kernels enabling concurrent ANE batch prefill + GPU decode on Apple Silicon for Qwen3.5 models.

  • 11.3x ANE batch prefill speedup (268 tok/s)
  • 79% power reduction for prefill component
  • <30 ms state transfer overhead
hybrid-ane-mlx-bench

Disaggregated LLM Inference on Apple Silicon

Benchmarking CoreML ANE prefill + MLX GPU decode for Qwen3.5 on Apple Silicon, with four inference strategies compared.

  • ANE prefill matches GPU at ~410 tokens
  • 282x GPU power reduction during prefill
  • 4 inference pipelines benchmarked
swift-qwen3-tts

On-Device Text-to-Speech

Native Swift implementation of Qwen3 TTS 0.6B for real-time, on-device speech synthesis.

  • 67% model compression (2.35 GB → 808 MB)
  • Real-time synthesis (RTF 0.68x)
  • 12 languages supported
Gemma-Prune

On-Device Vision Language Model

Multi-stage compression pipeline for deploying Gemma 3 4B VLM on consumer hardware.

  • 25% model compression (2.8 GB → 2.1 GB)
  • 110 tok/s text generation
  • 3.4x image processing speedup
OptMLX

MLX Memory Optimization Research

Exploring memory optimization techniques for the MLX framework on Apple Silicon.

  • Up to 20x faster mmap loading
  • Zero-copy model loading
  • Comprehensive benchmarks

About

AtomGradient is an independent research institution building the future of on-device AI. We conduct novel research in model compression, hardware-aware inference, and personal data integration — then ship those breakthroughs as free products that run entirely on your devices.

We believe intelligence belongs at the edge. Every model we build runs locally. Every product we ship is free. Every byte of your data stays on your device.

Edge AI Privacy Open Research