Abstract
We present Gemma-Prune, a multi-stage model compression pipeline that reduces the Gemma 3 4B IT QAT vision-language model from 2.8 GB to 2.1 GB while preserving both text generation and image understanding capabilities. Our approach combines vocabulary pruning (262K→144K tokens), vision encoder quantization with dimension padding, text layer removal, image resolution reduction (896→672 pixels), dead neuron pruning, and weight splitting for lazy loading.
Deployed on Apple Silicon via MLX Swift, the compressed model achieves 22% faster text generation (110 vs 90 tokens/s), 3.4x faster image prompt processing (184 vs 54 tokens/s), and 23% lower peak memory (2.2 GB vs 2.9 GB for text-only). We identify critical failure modes: vision layer removal destroys image understanding even with only 35 MB savings, while 448px resolution causes token repetition loops. The optimal 672px resolution reduces vision attention compute by ~3x without quality loss.
Models
Original
- 34 text layers, 262K vocab
- 896px vision, 256 image tokens
- 27-layer SigLIP encoder
- QAT 4-bit quantized
gemma-3-4b-it-qat-4bit-lite
- 31 text layers, 144K compact vocab
- 672px vision, 144 image tokens
- Vision fc2 4-bit quantized
- Single-file weights
gemma-3-4b-it-qat-4bit-mobile
- 31 layers, per-layer MLP pruning
- Layers 14-30: -25% neurons
- Split: 1.9 GB lang + 231 MB vision
- Text-only runtime: ~2.2 GB
Compression Pipeline
Seven sequential stages, each producing a validated checkpoint. Total savings: 709 MB (25%).
Vocabulary Pruning
262K → 144K tokens. Remove CJK/Arabic/Cyrillic tokens unused for English. Create token_map for ID remapping.
Vision fc2 Quantization
bf16 → 4-bit for 27 SigLIP layers. Zero-pad intermediate dim 4304 → 4352 (269 is prime — can't divide by group_size).
Text Layer Pruning
Remove layers 31, 32, 33 from 34-layer transformer. Deepest layers are most redundant.
Resolution Reduction
896 → 672px. Patches 4096 → 2304. Position embedding bilinear interpolation. ~3x less vision attention compute.
Vision Layer Removal
Removed — destroys image understanding completely. Pizza misidentified as "skin texture".
MLP Neuron Pruning
Layers 14-30: 60-100% dead neurons. Remove 25% per layer, align to group_size=64. Per-layer intermediate sizes.
Weight Splitting
Separate language (1.9 GB) and vision (231 MB). Text-only chat loads only language weights.
Key Technical Methods
Vocabulary Pruning with Token Map
The original embedding $\mathbf{E} \in \mathbb{R}^{262208 \times d}$ is compressed to $\mathbf{E}' \in \mathbb{R}^{144257 \times d}$ via a token map $\mathbf{M}$:
Critical Finding: BPE Coverage
Dictionary-only pruning (80K tokens) misses BPE-merged subword tokens essential for generation. Adding an ASCII vocabulary scan increased coverage to 144K tokens and resolved all quality issues. BPE coverage, not dictionary coverage, is the binding constraint.
Vision fc2 Dimension Padding
SigLIP's intermediate dimension $4304 = 16 \times 269$ (269 is prime) can't be divided by any MLX group size. Zero-padding to 4352 is mathematically equivalent:
Dead Neuron Pruning
Activation profiling over 20 forward passes reveals massive neuron death in deep layers:
Layers 14-30 have 60-100% dead neurons at threshold $\tau = 0.5$. Each layer's pruned dimension is stored in per_layer_intermediate_sizes.
Resolution vs Attention Cost
Vision self-attention scales quadratically with patch count $n$. A 25% linear resolution reduction yields ~68% compute savings:
Benchmarks
Apple Silicon, greedy decoding (temperature=0.0).
Text Generation
| Model | Disk | Prompt (t/s) | Generation (t/s) | Peak Memory |
|---|---|---|---|---|
| Original | 2.8 GB | 109 | 90 | 2910 MB |
| Lite (Step 4) | 2.3 GB | ~120 | ~110 | ~2500 MB |
| Mobile (Step 7) | 2.1 GB | 120 | 110 | 2231 MB |
Image Understanding
| Model | Prompt (t/s) | Generation (t/s) | Peak Memory | Quality |
|---|---|---|---|---|
| Original (896px) | 54 | 27 | ~5500 MB | Excellent |
| Step 3 (896px) | 73 | 61 | 4850 MB | Good |
| Mobile (672px) | 184 | 104 | 4358 MB | Good |
Failed Experiments
448px Resolution
Token repetition loops — insufficient visual information causes degenerate cyclic generation.
Remove Vision Layers 12-15 (-35 MB)
Complete hallucination — pizza misidentified as "skin texture". SigLIP encoder has near-zero redundancy.
80K Vocabulary (v1, dictionary-only)
Generation quality collapse. Missing BPE-merged tokens cause fragmented output. Fixed in v2 with ASCII vocab scan (144K tokens).
Quick Start
1. Clone
git clone https://github.com/AtomGradient/swift-gemma-cli.git gemma-cli
git clone https://github.com/AtomGradient/mlx-swift-lm.git mlx-swift-lm
2. Download Model
pip install huggingface_hub
huggingface-cli download AtomGradient/gemma-3-4b-it-qat-4bit-mobile --local-dir models/mobile
3. Build & Run
cd gemma-cli
swift build -c release
# Text generation
swift run -c release gemma-cli models/mobile \
--prompt "Explain quantum computing." --max-tokens 200 --temperature 0.0
# Image understanding
swift run -c release gemma-cli models/mobile \
--image photo.jpg \
--prompt "Describe this image in detail." --max-tokens 200
Citation
@article{atomgradient2025gemmaprune,
title={Gemma-Prune: A Multi-Stage Compression Pipeline for Deploying
Gemma 3 4B Vision-Language Model on Mobile Devices},
author={AtomGradient},
year={2025},
url={https://github.com/AtomGradient/swift-gemma-cli}
}