Efficient On-Device Text-to-Speech: A Post-Training Compression Pipeline for Qwen3 TTS on Apple Silicon

Five orthogonal compression techniques reduce Qwen3 TTS from 2.35 GB to 808 MB (67% reduction) while preserving audio quality, enabling real-time speech synthesis on edge devices.

AtomGradient

Read Paper (PDF) GitHub

Abstract

We present a comprehensive post-training compression pipeline for deploying the Qwen3 TTS 0.6B speech synthesis model on edge devices with Apple Silicon. Our approach combines five orthogonal, stackable techniques—vocabulary pruning, speech tokenizer pruning, 4-bit weight quantization, MLP neuron pruning, and transformer layer pruning—to reduce total model size from 2.35 GB to 808 MB (67% reduction) while preserving perceptually equivalent audio quality.

Central to our approach is a novel token map indirection scheme that reduces the text embedding matrix from 622 MB to 194 MB without retraining the tokenizer or modifying the model architecture. We implement the full inference pipeline natively in Swift using Apple's MLX framework, achieving faster-than-real-time synthesis (~0.8x RTF) with peak memory under 2.0 GB.

Compression Pipeline

Five orthogonal techniques, each targeting a distinct source of redundancy. They compose without interference and can be applied in any order.

Vocabulary Pruning

-428 MB

151K → 47K tokens via token map indirection. Lossless.

ST Encoder Stripping

-225 MB

Remove unused encoder (voice cloning only). Lossless.

FP32 → FP16

-228 MB

Speech tokenizer decoder. max|w| < 36, safe for fp16.

4-bit Quantization

-805 MB

249 linear layers. Embeddings kept in bf16.

Cumulative Size Reduction

Original (bf16)

2,494 MB

+ Vocab Pruning

2,066 MB

+ ST Pruning

1,613 MB

+ 4-bit Quant

808 MB

Main Model Speech Tokenizer

Token Map Indirection

The text embedding matrix [151,936 × 2,048] inherits Qwen3's full multilingual vocabulary, but TTS only uses ~47K tokens. Instead of retraining the tokenizer, we use a simple integer mapping array:

embed(t) = E'[m[t]] where m ∈ ℤ^151,936, E' ∈ ℝ^{47,427 × 2,048}

This is mathematically lossless—every preserved embedding row is an exact copy from the original matrix.

BPE Space-Prefix Insight

A critical finding: BPE tokenizers produce different tokens for the same word depending on context. Omitting space-prefixed variants causes mid-sentence words to map to zero vectors, triggering premature EOS.

encode("my")  = [2408]   # sentence-initial
encode(" my") = [847]    # mid-sentence (different token!)

Including both variants: 20K → 47K tokens (still only 31% of original 152K vocabulary).

Results

Model Size Comparison

Configuration	Main Model	Speech Tok.	Total	Reduction
Original (bf16)	1,812 MB	682 MB	2,494 MB	—
+ Vocab pruning	1,384 MB	682 MB	2,066 MB	17.2%
+ ST pruning	1,384 MB	229 MB	1,613 MB	35.3%
+ 4-bit quantization	579 MB	229 MB	808 MB	67.6%

Inference Performance (Apple Silicon)

Configuration	Disk (MB)	Peak Mem (GB)	Load (s)	RTF
Original bf16	2,494	5.14	2.74	0.70
Original 4-bit	1,611	4.66	2.73	0.74
Pruned bf16	1,613	2.81	2.58	0.66
Pruned 4-bit	808	2.13	2.50	0.68

Quality Assessment

Technique	Lossless?	Quality Impact
Vocabulary pruning	Lossless	Identical to original
ST pruning (fp16 + encoder strip)	Quasi-lossless	Imperceptible (~10^-4 rounding)
4-bit quantization	Lossy	Near-identical; ~1s avg. longer audio
MLP neuron pruning	Lossy	Near-identical (inactive neurons only)
Layer pruning (-3 layers)	Lossy	Minor prosody degradation

Model Architecture

Qwen3 TTS 0.6B follows a codec-based speech synthesis paradigm:

Component	Architecture	Key Parameters
Talker	28-layer Transformer	hidden=1024, heads=16 (GQA 8 KV), M-RoPE [24,20,20], SwiGLU MLP
CodePredictor	5-layer Transformer	16 codebook heads, QK-Norm with RMSNorm
SpeechTokenizer	Conv Decoder + Split-RVQ	1 semantic + 15 acoustic codebooks, 12.5 Hz, 24kHz output

Storage Breakdown (bf16)

Component	Size	% of Total
Text Embedding [151,936 × 2,048]	622 MB	34.4%
MLP Layers (×28)	623 MB	34.4%
Attention Layers (×28)	415 MB	22.9%
Codec Embedding + CodePredictor	132 MB	7.3%
Other (projections, norms, head)	19 MB	1.0%

Swift Inference Engine

The complete Qwen3 TTS pipeline is implemented natively in Swift using Apple's MLX framework, with no Python dependencies.

Token Map Support

func embedText(_ ids: MLXArray) -> MLXArray {
    if let tokenMap = model.textTokenMap {
        return model.textEmbedding(tokenMap[ids])  // mapped lookup
    }
    return model.textEmbedding(ids)                // direct lookup
}

Generation Length Control

To prevent runaway generation under stochastic sampling (temperature = 0.9):

T_max = min(T_config, max(75, 6 · |tokens(x)|))

Quick Start

git clone https://github.com/AtomGradient/swift-qwen3-tts.git
cd swift-qwen3-tts

swift run Qwen3TTSDemo \
  --model path/to/Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite \
  --speaker Aiden \
  --text "Hello, this is on-device TTS!" \
  --output output.wav

Pre-built Models

We release two edge-optimized model variants, ready for on-device deployment:

Model	Size	Compression	Quality
bf16-pruned-vocab-lite	1.5 GB	Vocab pruning + ST lite	Lossless
4bit-pruned-vocab-lite	808 MB	+ 4-bit quantization	Near-identical

Both models support 9 speakers (Aiden, Serena, Vivian, Ryan, Uncle Fu, Ono Anna, Sohee, Eric, Dylan) across 12 languages with emotion control.

Citation

@article{atomgradient2026efficient,
  title={Efficient On-Device Text-to-Speech: A Post-Training Compression
         Pipeline for Qwen3 TTS on Apple Silicon},
  author={AtomGradient},
  year={2026},
  url={https://github.com/AtomGradient/swift-qwen3-tts}
}