Gemma-Prune: Multi-Stage VLM Compression for Mobile Devices

Abstract

We present Gemma-Prune, a multi-stage model compression pipeline that reduces the Gemma 3 4B IT QAT vision-language model from 2.8 GB to 2.1 GB while preserving both text generation and image understanding capabilities. Our approach combines vocabulary pruning (262K→144K tokens), vision encoder quantization with dimension padding, text layer removal, image resolution reduction (896→672 pixels), dead neuron pruning, and weight splitting for lazy loading.

Deployed on Apple Silicon via MLX Swift, the compressed model achieves 22% faster text generation (110 vs 90 tokens/s), 3.4x faster image prompt processing (184 vs 54 tokens/s), and 23% lower peak memory (2.2 GB vs 2.9 GB for text-only). We identify critical failure modes: vision layer removal destroys image understanding even with only 35 MB savings, while 448px resolution causes token repetition loops. The optimal 672px resolution reduces vision attention compute by ~3x without quality loss.

Models

BASELINE

Original

2.8 GB

34 text layers, 262K vocab
896px vision, 256 image tokens
27-layer SigLIP encoder
QAT 4-bit quantized

HuggingFace

LITE

gemma-3-4b-it-qat-4bit-lite

2.3 GB

31 text layers, 144K compact vocab
672px vision, 144 image tokens
Vision fc2 4-bit quantized
Single-file weights

HuggingFace

MOBILE

gemma-3-4b-it-qat-4bit-mobile

2.1 GB

31 layers, per-layer MLP pruning
Layers 14-30: -25% neurons
Split: 1.9 GB lang + 231 MB vision
Text-only runtime: ~2.2 GB

HuggingFace

Compression Pipeline

Seven sequential stages, each producing a validated checkpoint. Total savings: 709 MB (25%).

Vocabulary Pruning

262K → 144K tokens. Remove CJK/Arabic/Cyrillic tokens unused for English. Create token_map for ID remapping.

-170 MB

Vision fc2 Quantization

bf16 → 4-bit for 27 SigLIP layers. Zero-pad intermediate dim 4304 → 4352 (269 is prime — can't divide by group_size).

-191 MB

Text Layer Pruning

Remove layers 31, 32, 33 from 34-layer transformer. Deepest layers are most redundant.

-159 MB

Resolution Reduction

896 → 672px. Patches 4096 → 2304. Position embedding bilinear interpolation. ~3x less vision attention compute.

runtime

Vision Layer Removal

Removed — destroys image understanding completely. Pizza misidentified as "skin texture".

FAILED

MLP Neuron Pruning

Layers 14-30: 60-100% dead neurons. Remove 25% per layer, align to group_size=64. Per-layer intermediate sizes.

-188 MB

Weight Splitting

Separate language (1.9 GB) and vision (231 MB). Text-only chat loads only language weights.

runtime

Key Technical Methods

Vocabulary Pruning with Token Map

The original embedding $\mathbf{E} \in \mathbb{R}^{262208 \times d}$ is compressed to $\mathbf{E}' \in \mathbb{R}^{144257 \times d}$ via a token map $\mathbf{M}$:

\mathbf{h} = \mathbf{E}'[\mathbf{M}[t]], \quad \mathbf{M}[t] = \begin{cases} k & \text{if token } t \text{ is retained (}k\text{-th)} \\ 0 & \text{if token } t \text{ is pruned (zero vector)} \end{cases}

Critical Finding: BPE Coverage

Dictionary-only pruning (80K tokens) misses BPE-merged subword tokens essential for generation. Adding an ASCII vocabulary scan increased coverage to 144K tokens and resolved all quality issues. BPE coverage, not dictionary coverage, is the binding constraint.

Vision fc2 Dimension Padding

SigLIP's intermediate dimension $4304 = 16 \times 269$ (269 is prime) can't be divided by any MLX group size. Zero-padding to 4352 is mathematically equivalent:

\mathbf{W}'_{\text{fc2}} \cdot \begin{bmatrix} \mathbf{a} \\ \mathbf{0}_{48} \end{bmatrix} = [\mathbf{W}_{\text{fc2}} \mid \mathbf{0}] \cdot \begin{bmatrix} \mathbf{a} \\ \mathbf{0}_{48} \end{bmatrix} = \mathbf{W}_{\text{fc2}} \cdot \mathbf{a}

Dead Neuron Pruning

Activation profiling over 20 forward passes reveals massive neuron death in deep layers:

\text{Keep neuron } i \text{ in layer } \ell \iff \bar{a}_i^{(\ell)} \geq \tau, \quad d'_\ell = \text{align}_{64}\left(\min\left(|\{i : \bar{a}_i^{(\ell)} \geq \tau\}|,\ 0.75 \cdot d_\ell\right)\right)

Layers 14-30 have 60-100% dead neurons at threshold $\tau = 0.5$. Each layer's pruned dimension is stored in per_layer_intermediate_sizes.

Resolution vs Attention Cost

Vision self-attention scales quadratically with patch count $n$. A 25% linear resolution reduction yields ~68% compute savings:

\text{896px:} \ n = 64^2 = 4096, \quad \mathcal{O}(4096^2 \cdot d) \approx 16.8\text{M} \cdot d$$ $$\text{672px:} \ n = 48^2 = 2304, \quad \mathcal{O}(2304^2 \cdot d) \approx 5.3\text{M} \cdot d \quad (\sim\!3.2\times \text{ reduction})

Benchmarks

Apple Silicon, greedy decoding (temperature=0.0).

Text Generation

Model	Disk	Prompt (t/s)	Generation (t/s)	Peak Memory
Original	2.8 GB	109	90	2910 MB
Lite (Step 4)	2.3 GB	~120	~110	~2500 MB
Mobile (Step 7)	2.1 GB	120	110	2231 MB

Image Understanding

Model	Prompt (t/s)	Generation (t/s)	Peak Memory	Quality
Original (896px)	54	27	~5500 MB	Excellent
Step 3 (896px)	73	61	4850 MB	Good
Mobile (672px)	184	104	4358 MB	Good

Failed Experiments

448px Resolution

Token repetition loops — insufficient visual information causes degenerate cyclic generation.

Remove Vision Layers 12-15 (-35 MB)

Complete hallucination — pizza misidentified as "skin texture". SigLIP encoder has near-zero redundancy.

80K Vocabulary (v1, dictionary-only)

Generation quality collapse. Missing BPE-merged tokens cause fragmented output. Fixed in v2 with ASCII vocab scan (144K tokens).

Quick Start

1. Clone

git clone https://github.com/AtomGradient/swift-gemma-cli.git gemma-cli
git clone https://github.com/AtomGradient/mlx-swift-lm.git mlx-swift-lm

2. Download Model

pip install huggingface_hub
huggingface-cli download AtomGradient/gemma-3-4b-it-qat-4bit-mobile --local-dir models/mobile

3. Build & Run

cd gemma-cli
swift build -c release

# Text generation
swift run -c release gemma-cli models/mobile \
  --prompt "Explain quantum computing." --max-tokens 200 --temperature 0.0

# Image understanding
swift run -c release gemma-cli models/mobile \
  --image photo.jpg \
  --prompt "Describe this image in detail." --max-tokens 200

Citation

@article{atomgradient2025gemmaprune,
  title={Gemma-Prune: A Multi-Stage Compression Pipeline for Deploying
         Gemma 3 4B Vision-Language Model on Mobile Devices},
  author={AtomGradient},
  year={2025},
  url={https://github.com/AtomGradient/swift-gemma-cli}
}