OptMLX

Exploring zero-copy mmap loading & KV cache pre-allocation for MLX on Apple Silicon

8 Models Benchmarked
3 Benchmark Suites
1 Bug Discovered & Fixed

What We Explored

Can llama.cpp's memory strategies improve MLX? We implemented and benchmarked two approaches.

🗺️

mmap Zero-Copy Loading

We implemented memory-mapped safetensors loading with Metal buffer sharing via UMA. Results are mixed: dramatic speedups for some large models, but slower than standard loading for small models. MLX's pread-based loader is already highly efficient.

📦

KV Cache Pre-Allocation

We added optional upfront KV cache allocation to flatten memory growth. It works but doesn't help throughput: generation speed is unchanged, while startup latency and memory usage increase. MLX's dynamic growth handles this well already.

🐛

Quantized dtype Bug Fix

We discovered that QuantizedLinear.weight.dtype returns uint32 (packed storage) instead of the working precision. The fix uses scales.dtype instead—arguably the most practically useful finding of this project.

Architecture

How zero-copy loading and KV pre-allocation work under the hood

mmap Zero-Copy Data Flow

Safetensors File on Disk mmap() Virtual Memory Metal Buffer UMA Shared Parent Array uint8 over mmap Tensor A offset view Tensor B offset view MAP_PRIVATE zero-copy CPU & GPU share same physical memory (UMA)

KV Cache: Dynamic vs Pre-Allocated

0 256 512 768 MB 0 1024 2048 3072 4096 Tokens Generated Dynamic Pre-allocated

Benchmarks

Tested on Apple M1 Max (32 GB) with eight Qwen3 quantized model variants. The data tells an honest story: MLX's defaults are hard to beat.

Model Loading Speed

Loading Time Comparison

ModelSize (GB)Standard (s)Mmap (s)Speedup
Qwen3-4B-4bit2.110.1010.1310.77x
Qwen3-4B-8bit3.980.1840.2460.75x
Qwen3-8B-3bit3.340.1460.0751.95x
Qwen3-8B-4bit4.290.3520.3281.07x
Qwen3-8B-6bit6.200.6370.4351.46x
Qwen3-8B-8bit8.112.5720.12520.65x
Qwen3-14B-4bit7.742.3230.5354.34x
Qwen3-14B-6bit11.183.7015.3880.69x

Inference Performance (mmap vs Standard)

Model Load Time (s) Prompt (t/s) Generation (t/s) Peak Mem (GB)
StdMmapStdMmapStdMmapStdMmap
4B-4bit0.890.85189.1188.656.958.52.382.38
4B-8bit1.310.98165.8159.741.042.04.374.37
8B-3bit1.220.69100.892.329.929.44.374.37
8B-4bit1.341.0389.8100.730.130.14.728.82*
8B-6bit2.001.2986.775.720.720.78.828.82
8B-8bit3.040.7764.480.621.621.38.828.84
14B-4bit1.951.0143.443.216.216.58.8411.09*
14B-6bit3.413.0739.737.012.112.012.0812.16

* Memory doubling observed — likely due to mmap region coexisting with materialized tensor copies.

KV Cache Pre-Allocation Impact

ModelModeFTLT (ms)Prompt (t/s)Gen (t/s)Peak Mem (GB)+Mem (GB)
Qwen3-4B-4bitDynamic260.4200.462.02.221
Pre-alloc 2k276.2183.563.02.471+0.250
Pre-alloc 4k285.2171.962.32.736+0.515
Qwen3-8B-4bitDynamic419.2100.531.34.395
Pre-alloc 2k428.099.130.34.633+0.238
Pre-alloc 4k436.198.029.74.908+0.513
Qwen3-14B-4bitDynamic859.644.116.77.827
Pre-alloc 2k917.441.116.08.101+0.274
Pre-alloc 4k901.541.716.18.414+0.587

Quick Start

Get started with OptMLX in seconds

CLI Usage
# Zero-copy mmap loading
python -m mlx_lm.generate \
  --model Qwen/Qwen3-8B-MLX-4bit \
  --use-mmap \
  --prompt "Explain quantum computing"

# KV cache pre-allocation (4096 tokens)
python -m mlx_lm.generate \
  --model Qwen/Qwen3-8B-MLX-4bit \
  --pre-allocate-kv 4096 \
  --prompt "Write a long essay about AI"

# Both optimizations combined
python -m mlx_lm.generate \
  --model Qwen/Qwen3-8B-MLX-4bit \
  --use-mmap --pre-allocate-kv 4096 \
  --prompt "Hello!"
Python API
import mlx.core as mx
from mlx_lm import load, generate
from mlx_lm.models.cache import make_prompt_cache

# Load with mmap zero-copy
model, tokenizer = load(
    "Qwen/Qwen3-8B-MLX-4bit",
    use_mmap=True
)

# Create pre-allocated KV cache
cache = make_prompt_cache(
    model,
    max_context_length=4096
)

# Generate with optimized cache
response = generate(
    model, tokenizer,
    prompt="Hello!",
    max_tokens=200
)

Paper

Read the full technical report

Exploring Zero-Copy mmap Loading and KV Cache Pre-Allocation for MLX on Apple Silicon

AtomGradient, 2026

An exploratory study on whether llama.cpp's memory management strategies can improve MLX. Includes implementation details, benchmark data across 8 models, and analysis of why MLX's existing design is already well-suited to its use case.

Download PDF
@article{optmlx2026, title = {Exploring Zero-Copy mmap Loading and KV Cache Pre-Allocation for MLX on Apple Silicon}, author = {AtomGradient}, year = {2026}, url = {https://github.com/AtomGradient/OptMLX} }