Exploring zero-copy mmap loading & KV cache pre-allocation for MLX on Apple Silicon
Can llama.cpp's memory strategies improve MLX? We implemented and benchmarked two approaches.
We implemented memory-mapped safetensors loading with Metal buffer sharing via UMA. Results are mixed: dramatic speedups for some large models, but slower than standard loading for small models. MLX's pread-based loader is already highly efficient.
We added optional upfront KV cache allocation to flatten memory growth. It works but doesn't help throughput: generation speed is unchanged, while startup latency and memory usage increase. MLX's dynamic growth handles this well already.
We discovered that QuantizedLinear.weight.dtype returns uint32 (packed storage) instead of the working precision. The fix uses scales.dtype instead—arguably the most practically useful finding of this project.
How zero-copy loading and KV pre-allocation work under the hood
Tested on Apple M1 Max (32 GB) with eight Qwen3 quantized model variants. The data tells an honest story: MLX's defaults are hard to beat.
| Model | Size (GB) | Standard (s) | Mmap (s) | Speedup |
|---|---|---|---|---|
| Qwen3-4B-4bit | 2.11 | 0.101 | 0.131 | 0.77x |
| Qwen3-4B-8bit | 3.98 | 0.184 | 0.246 | 0.75x |
| Qwen3-8B-3bit | 3.34 | 0.146 | 0.075 | 1.95x |
| Qwen3-8B-4bit | 4.29 | 0.352 | 0.328 | 1.07x |
| Qwen3-8B-6bit | 6.20 | 0.637 | 0.435 | 1.46x |
| Qwen3-8B-8bit | 8.11 | 2.572 | 0.125 | 20.65x |
| Qwen3-14B-4bit | 7.74 | 2.323 | 0.535 | 4.34x |
| Qwen3-14B-6bit | 11.18 | 3.701 | 5.388 | 0.69x |
| Model | Load Time (s) | Prompt (t/s) | Generation (t/s) | Peak Mem (GB) | ||||
|---|---|---|---|---|---|---|---|---|
| Std | Mmap | Std | Mmap | Std | Mmap | Std | Mmap | |
| 4B-4bit | 0.89 | 0.85 | 189.1 | 188.6 | 56.9 | 58.5 | 2.38 | 2.38 |
| 4B-8bit | 1.31 | 0.98 | 165.8 | 159.7 | 41.0 | 42.0 | 4.37 | 4.37 |
| 8B-3bit | 1.22 | 0.69 | 100.8 | 92.3 | 29.9 | 29.4 | 4.37 | 4.37 |
| 8B-4bit | 1.34 | 1.03 | 89.8 | 100.7 | 30.1 | 30.1 | 4.72 | 8.82* |
| 8B-6bit | 2.00 | 1.29 | 86.7 | 75.7 | 20.7 | 20.7 | 8.82 | 8.82 |
| 8B-8bit | 3.04 | 0.77 | 64.4 | 80.6 | 21.6 | 21.3 | 8.82 | 8.84 |
| 14B-4bit | 1.95 | 1.01 | 43.4 | 43.2 | 16.2 | 16.5 | 8.84 | 11.09* |
| 14B-6bit | 3.41 | 3.07 | 39.7 | 37.0 | 12.1 | 12.0 | 12.08 | 12.16 |
* Memory doubling observed — likely due to mmap region coexisting with materialized tensor copies.
| Model | Mode | FTLT (ms) | Prompt (t/s) | Gen (t/s) | Peak Mem (GB) | +Mem (GB) |
|---|---|---|---|---|---|---|
| Qwen3-4B-4bit | Dynamic | 260.4 | 200.4 | 62.0 | 2.221 | — |
| Pre-alloc 2k | 276.2 | 183.5 | 63.0 | 2.471 | +0.250 | |
| Pre-alloc 4k | 285.2 | 171.9 | 62.3 | 2.736 | +0.515 | |
| Qwen3-8B-4bit | Dynamic | 419.2 | 100.5 | 31.3 | 4.395 | — |
| Pre-alloc 2k | 428.0 | 99.1 | 30.3 | 4.633 | +0.238 | |
| Pre-alloc 4k | 436.1 | 98.0 | 29.7 | 4.908 | +0.513 | |
| Qwen3-14B-4bit | Dynamic | 859.6 | 44.1 | 16.7 | 7.827 | — |
| Pre-alloc 2k | 917.4 | 41.1 | 16.0 | 8.101 | +0.274 | |
| Pre-alloc 4k | 901.5 | 41.7 | 16.1 | 8.414 | +0.587 |
Get started with OptMLX in seconds
# Zero-copy mmap loading python -m mlx_lm.generate \ --model Qwen/Qwen3-8B-MLX-4bit \ --use-mmap \ --prompt "Explain quantum computing" # KV cache pre-allocation (4096 tokens) python -m mlx_lm.generate \ --model Qwen/Qwen3-8B-MLX-4bit \ --pre-allocate-kv 4096 \ --prompt "Write a long essay about AI" # Both optimizations combined python -m mlx_lm.generate \ --model Qwen/Qwen3-8B-MLX-4bit \ --use-mmap --pre-allocate-kv 4096 \ --prompt "Hello!"
import mlx.core as mx from mlx_lm import load, generate from mlx_lm.models.cache import make_prompt_cache # Load with mmap zero-copy model, tokenizer = load( "Qwen/Qwen3-8B-MLX-4bit", use_mmap=True ) # Create pre-allocated KV cache cache = make_prompt_cache( model, max_context_length=4096 ) # Generate with optimized cache response = generate( model, tokenizer, prompt="Hello!", max_tokens=200 )
Read the full technical report
AtomGradient, 2026
An exploratory study on whether llama.cpp's memory management strategies can improve MLX. Includes implementation details, benchmark data across 8 models, and analysis of why MLX's existing design is already well-suited to its use case.
Download PDF