OptMLX

What We Explored

Can llama.cpp's memory strategies improve MLX? We implemented and benchmarked two approaches.

🗺️

mmap Zero-Copy Loading

We implemented memory-mapped safetensors loading with Metal buffer sharing via UMA. Results are mixed: dramatic speedups for some large models, but slower than standard loading for small models. MLX's pread-based loader is already highly efficient.

📦

KV Cache Pre-Allocation

We added optional upfront KV cache allocation to flatten memory growth. It works but doesn't help throughput: generation speed is unchanged, while startup latency and memory usage increase. MLX's dynamic growth handles this well already.

🐛

Quantized dtype Bug Fix

We discovered that QuantizedLinear.weight.dtype returns uint32 (packed storage) instead of the working precision. The fix uses scales.dtype instead—arguably the most practically useful finding of this project.

Benchmarks

Tested on Apple M1 Max (32 GB) with eight Qwen3 quantized model variants. The data tells an honest story: MLX's defaults are hard to beat.

Model Loading Speed

Loading Time Comparison

Model	Size (GB)	Standard (s)	Mmap (s)	Speedup
Qwen3-4B-4bit	2.11	0.101	0.131	0.77x
Qwen3-4B-8bit	3.98	0.184	0.246	0.75x
Qwen3-8B-3bit	3.34	0.146	0.075	1.95x
Qwen3-8B-4bit	4.29	0.352	0.328	1.07x
Qwen3-8B-6bit	6.20	0.637	0.435	1.46x
Qwen3-8B-8bit	8.11	2.572	0.125	20.65x
Qwen3-14B-4bit	7.74	2.323	0.535	4.34x
Qwen3-14B-6bit	11.18	3.701	5.388	0.69x

Inference Performance (mmap vs Standard)

Model	Load Time (s)		Prompt (t/s)		Generation (t/s)		Peak Mem (GB)
	Std	Mmap	Std	Mmap	Std	Mmap	Std	Mmap
4B-4bit	0.89	0.85	189.1	188.6	56.9	58.5	2.38	2.38
4B-8bit	1.31	0.98	165.8	159.7	41.0	42.0	4.37	4.37
8B-3bit	1.22	0.69	100.8	92.3	29.9	29.4	4.37	4.37
8B-4bit	1.34	1.03	89.8	100.7	30.1	30.1	4.72	8.82*
8B-6bit	2.00	1.29	86.7	75.7	20.7	20.7	8.82	8.82
8B-8bit	3.04	0.77	64.4	80.6	21.6	21.3	8.82	8.84
14B-4bit	1.95	1.01	43.4	43.2	16.2	16.5	8.84	11.09*
14B-6bit	3.41	3.07	39.7	37.0	12.1	12.0	12.08	12.16

* Memory doubling observed — likely due to mmap region coexisting with materialized tensor copies.

KV Cache Pre-Allocation Impact

Model	Mode	FTLT (ms)	Prompt (t/s)	Gen (t/s)	Peak Mem (GB)	+Mem (GB)
Qwen3-4B-4bit	Dynamic	260.4	200.4	62.0	2.221	—
	Pre-alloc 2k	276.2	183.5	63.0	2.471	+0.250
	Pre-alloc 4k	285.2	171.9	62.3	2.736	+0.515
Qwen3-8B-4bit	Dynamic	419.2	100.5	31.3	4.395	—
	Pre-alloc 2k	428.0	99.1	30.3	4.633	+0.238
	Pre-alloc 4k	436.1	98.0	29.7	4.908	+0.513
Qwen3-14B-4bit	Dynamic	859.6	44.1	16.7	7.827	—
	Pre-alloc 2k	917.4	41.1	16.0	8.101	+0.274
	Pre-alloc 4k	901.5	41.7	16.1	8.414	+0.587

Quick Start

Get started with OptMLX in seconds

# Zero-copy mmap loading
python -m mlx_lm.generate \
  --model Qwen/Qwen3-8B-MLX-4bit \
  --use-mmap \
  --prompt "Explain quantum computing"

# KV cache pre-allocation (4096 tokens)
python -m mlx_lm.generate \
  --model Qwen/Qwen3-8B-MLX-4bit \
  --pre-allocate-kv 4096 \
  --prompt "Write a long essay about AI"

# Both optimizations combined
python -m mlx_lm.generate \
  --model Qwen/Qwen3-8B-MLX-4bit \
  --use-mmap --pre-allocate-kv 4096 \
  --prompt "Hello!"

import mlx.core as mx
from mlx_lm import load, generate
from mlx_lm.models.cache import make_prompt_cache

# Load with mmap zero-copy
model, tokenizer = load(
    "Qwen/Qwen3-8B-MLX-4bit",
    use_mmap=True
)

# Create pre-allocated KV cache
cache = make_prompt_cache(
    model,
    max_context_length=4096
)

# Generate with optimized cache
response = generate(
    model, tokenizer,
    prompt="Hello!",
    max_tokens=200
)

Paper

Read the full technical report

Exploring Zero-Copy mmap Loading and KV Cache Pre-Allocation for MLX on Apple Silicon

AtomGradient, 2026

An exploratory study on whether llama.cpp's memory management strategies can improve MLX. Includes implementation details, benchmark data across 8 models, and analysis of why MLX's existing design is already well-suited to its use case.

Download PDF

@article{optmlx2026, title = {Exploring Zero-Copy mmap Loading and KV Cache Pre-Allocation for MLX on Apple Silicon}, author = {AtomGradient}, year = {2026}, url = {https://github.com/AtomGradient/OptMLX} }