Gemma-Prune

A multi-stage compression pipeline for deploying Gemma 3 4B Vision-Language Model on Apple Silicon mobile devices via MLX Swift.

2.8 → 2.1 GB
Model Size (25% smaller)
3.4x
Faster Image Processing
23%
Less Peak Memory
22%
Faster Generation

Abstract

We present Gemma-Prune, a multi-stage model compression pipeline that reduces the Gemma 3 4B IT QAT vision-language model from 2.8 GB to 2.1 GB while preserving both text generation and image understanding capabilities. Our approach combines vocabulary pruning (262K→144K tokens), vision encoder quantization with dimension padding, text layer removal, image resolution reduction (896→672 pixels), dead neuron pruning, and weight splitting for lazy loading.

Deployed on Apple Silicon via MLX Swift, the compressed model achieves 22% faster text generation (110 vs 90 tokens/s), 3.4x faster image prompt processing (184 vs 54 tokens/s), and 23% lower peak memory (2.2 GB vs 2.9 GB for text-only). We identify critical failure modes: vision layer removal destroys image understanding even with only 35 MB savings, while 448px resolution causes token repetition loops. The optimal 672px resolution reduces vision attention compute by ~3x without quality loss.

Models

BASELINE

Original

2.8 GB
  • 34 text layers, 262K vocab
  • 896px vision, 256 image tokens
  • 27-layer SigLIP encoder
  • QAT 4-bit quantized
HuggingFace
LITE

gemma-3-4b-it-qat-4bit-lite

2.3 GB
  • 31 text layers, 144K compact vocab
  • 672px vision, 144 image tokens
  • Vision fc2 4-bit quantized
  • Single-file weights
HuggingFace
MOBILE

gemma-3-4b-it-qat-4bit-mobile

2.1 GB
  • 31 layers, per-layer MLP pruning
  • Layers 14-30: -25% neurons
  • Split: 1.9 GB lang + 231 MB vision
  • Text-only runtime: ~2.2 GB
HuggingFace

Compression Pipeline

Seven sequential stages, each producing a validated checkpoint. Total savings: 709 MB (25%).

1

Vocabulary Pruning

262K → 144K tokens. Remove CJK/Arabic/Cyrillic tokens unused for English. Create token_map for ID remapping.

-170 MB
2

Vision fc2 Quantization

bf16 → 4-bit for 27 SigLIP layers. Zero-pad intermediate dim 4304 → 4352 (269 is prime — can't divide by group_size).

-191 MB
3

Text Layer Pruning

Remove layers 31, 32, 33 from 34-layer transformer. Deepest layers are most redundant.

-159 MB
4

Resolution Reduction

896 → 672px. Patches 4096 → 2304. Position embedding bilinear interpolation. ~3x less vision attention compute.

runtime
5

Vision Layer Removal

Removed — destroys image understanding completely. Pizza misidentified as "skin texture".

FAILED
6

MLP Neuron Pruning

Layers 14-30: 60-100% dead neurons. Remove 25% per layer, align to group_size=64. Per-layer intermediate sizes.

-188 MB
7

Weight Splitting

Separate language (1.9 GB) and vision (231 MB). Text-only chat loads only language weights.

runtime

Key Technical Methods

Vocabulary Pruning with Token Map

The original embedding $\mathbf{E} \in \mathbb{R}^{262208 \times d}$ is compressed to $\mathbf{E}' \in \mathbb{R}^{144257 \times d}$ via a token map $\mathbf{M}$:

$$\mathbf{h} = \mathbf{E}'[\mathbf{M}[t]], \quad \mathbf{M}[t] = \begin{cases} k & \text{if token } t \text{ is retained (}k\text{-th)} \\ 0 & \text{if token } t \text{ is pruned (zero vector)} \end{cases}$$

Critical Finding: BPE Coverage

Dictionary-only pruning (80K tokens) misses BPE-merged subword tokens essential for generation. Adding an ASCII vocabulary scan increased coverage to 144K tokens and resolved all quality issues. BPE coverage, not dictionary coverage, is the binding constraint.

Vision fc2 Dimension Padding

SigLIP's intermediate dimension $4304 = 16 \times 269$ (269 is prime) can't be divided by any MLX group size. Zero-padding to 4352 is mathematically equivalent:

$$\mathbf{W}'_{\text{fc2}} \cdot \begin{bmatrix} \mathbf{a} \\ \mathbf{0}_{48} \end{bmatrix} = [\mathbf{W}_{\text{fc2}} \mid \mathbf{0}] \cdot \begin{bmatrix} \mathbf{a} \\ \mathbf{0}_{48} \end{bmatrix} = \mathbf{W}_{\text{fc2}} \cdot \mathbf{a}$$

Dead Neuron Pruning

Activation profiling over 20 forward passes reveals massive neuron death in deep layers:

$$\text{Keep neuron } i \text{ in layer } \ell \iff \bar{a}_i^{(\ell)} \geq \tau, \quad d'_\ell = \text{align}_{64}\left(\min\left(|\{i : \bar{a}_i^{(\ell)} \geq \tau\}|,\ 0.75 \cdot d_\ell\right)\right)$$

Layers 14-30 have 60-100% dead neurons at threshold $\tau = 0.5$. Each layer's pruned dimension is stored in per_layer_intermediate_sizes.

Resolution vs Attention Cost

Vision self-attention scales quadratically with patch count $n$. A 25% linear resolution reduction yields ~68% compute savings:

$$\text{896px:} \ n = 64^2 = 4096, \quad \mathcal{O}(4096^2 \cdot d) \approx 16.8\text{M} \cdot d$$ $$\text{672px:} \ n = 48^2 = 2304, \quad \mathcal{O}(2304^2 \cdot d) \approx 5.3\text{M} \cdot d \quad (\sim\!3.2\times \text{ reduction})$$

Benchmarks

Apple Silicon, greedy decoding (temperature=0.0).

Text Generation

ModelDiskPrompt (t/s)Generation (t/s)Peak Memory
Original2.8 GB109902910 MB
Lite (Step 4)2.3 GB~120~110~2500 MB
Mobile (Step 7)2.1 GB1201102231 MB

Image Understanding

ModelPrompt (t/s)Generation (t/s)Peak MemoryQuality
Original (896px)5427~5500 MBExcellent
Step 3 (896px)73614850 MBGood
Mobile (672px)1841044358 MBGood

Failed Experiments

448px Resolution

Token repetition loops — insufficient visual information causes degenerate cyclic generation.

Remove Vision Layers 12-15 (-35 MB)

Complete hallucination — pizza misidentified as "skin texture". SigLIP encoder has near-zero redundancy.

80K Vocabulary (v1, dictionary-only)

Generation quality collapse. Missing BPE-merged tokens cause fragmented output. Fixed in v2 with ASCII vocab scan (144K tokens).

Quick Start

1. Clone

git clone https://github.com/AtomGradient/swift-gemma-cli.git gemma-cli
git clone https://github.com/AtomGradient/mlx-swift-lm.git mlx-swift-lm

2. Download Model

pip install huggingface_hub
huggingface-cli download AtomGradient/gemma-3-4b-it-qat-4bit-mobile --local-dir models/mobile

3. Build & Run

cd gemma-cli
swift build -c release

# Text generation
swift run -c release gemma-cli models/mobile \
  --prompt "Explain quantum computing." --max-tokens 200 --temperature 0.0

# Image understanding
swift run -c release gemma-cli models/mobile \
  --image photo.jpg \
  --prompt "Describe this image in detail." --max-tokens 200

Citation

@article{atomgradient2025gemmaprune,
  title={Gemma-Prune: A Multi-Stage Compression Pipeline for Deploying
         Gemma 3 4B Vision-Language Model on Mobile Devices},
  author={AtomGradient},
  year={2025},
  url={https://github.com/AtomGradient/swift-gemma-cli}
}