Five orthogonal compression techniques reduce Qwen3 TTS from 2.35 GB to 808 MB (67% reduction) while preserving audio quality, enabling real-time speech synthesis on edge devices.
We present a comprehensive post-training compression pipeline for deploying the Qwen3 TTS 0.6B speech synthesis model on edge devices with Apple Silicon. Our approach combines five orthogonal, stackable techniques—vocabulary pruning, speech tokenizer pruning, 4-bit weight quantization, MLP neuron pruning, and transformer layer pruning—to reduce total model size from 2.35 GB to 808 MB (67% reduction) while preserving perceptually equivalent audio quality.
Central to our approach is a novel token map indirection scheme that reduces the text embedding matrix from 622 MB to 194 MB without retraining the tokenizer or modifying the model architecture. We implement the full inference pipeline natively in Swift using Apple's MLX framework, achieving faster-than-real-time synthesis (~0.8x RTF) with peak memory under 2.0 GB.
Five orthogonal techniques, each targeting a distinct source of redundancy. They compose without interference and can be applied in any order.
Main Model Speech Tokenizer
The text embedding matrix [151,936 × 2,048] inherits Qwen3's full multilingual vocabulary, but TTS only uses ~47K tokens. Instead of retraining the tokenizer, we use a simple integer mapping array:
This is mathematically lossless—every preserved embedding row is an exact copy from the original matrix.
A critical finding: BPE tokenizers produce different tokens for the same word depending on context. Omitting space-prefixed variants causes mid-sentence words to map to zero vectors, triggering premature EOS.
encode("my") = [2408] # sentence-initial
encode(" my") = [847] # mid-sentence (different token!)
Including both variants: 20K → 47K tokens (still only 31% of original 152K vocabulary).
| Configuration | Main Model | Speech Tok. | Total | Reduction |
|---|---|---|---|---|
| Original (bf16) | 1,812 MB | 682 MB | 2,494 MB | — |
| + Vocab pruning | 1,384 MB | 682 MB | 2,066 MB | 17.2% |
| + ST pruning | 1,384 MB | 229 MB | 1,613 MB | 35.3% |
| + 4-bit quantization | 579 MB | 229 MB | 808 MB | 67.6% |
| Configuration | Disk (MB) | Peak Mem (GB) | Load (s) | RTF |
|---|---|---|---|---|
| Original bf16 | 2,494 | 5.14 | 2.74 | 0.70 |
| Original 4-bit | 1,611 | 4.66 | 2.73 | 0.74 |
| Pruned bf16 | 1,613 | 2.81 | 2.58 | 0.66 |
| Pruned 4-bit | 808 | 2.13 | 2.50 | 0.68 |
| Technique | Lossless? | Quality Impact |
|---|---|---|
| Vocabulary pruning | Lossless | Identical to original |
| ST pruning (fp16 + encoder strip) | Quasi-lossless | Imperceptible (~10-4 rounding) |
| 4-bit quantization | Lossy | Near-identical; ~1s avg. longer audio |
| MLP neuron pruning | Lossy | Near-identical (inactive neurons only) |
| Layer pruning (-3 layers) | Lossy | Minor prosody degradation |
Qwen3 TTS 0.6B follows a codec-based speech synthesis paradigm:
| Component | Architecture | Key Parameters |
|---|---|---|
| Talker | 28-layer Transformer | hidden=1024, heads=16 (GQA 8 KV), M-RoPE [24,20,20], SwiGLU MLP |
| CodePredictor | 5-layer Transformer | 16 codebook heads, QK-Norm with RMSNorm |
| SpeechTokenizer | Conv Decoder + Split-RVQ | 1 semantic + 15 acoustic codebooks, 12.5 Hz, 24kHz output |
| Component | Size | % of Total |
|---|---|---|
| Text Embedding [151,936 × 2,048] | 622 MB | 34.4% |
| MLP Layers (×28) | 623 MB | 34.4% |
| Attention Layers (×28) | 415 MB | 22.9% |
| Codec Embedding + CodePredictor | 132 MB | 7.3% |
| Other (projections, norms, head) | 19 MB | 1.0% |
The complete Qwen3 TTS pipeline is implemented natively in Swift using Apple's MLX framework, with no Python dependencies.
func embedText(_ ids: MLXArray) -> MLXArray {
if let tokenMap = model.textTokenMap {
return model.textEmbedding(tokenMap[ids]) // mapped lookup
}
return model.textEmbedding(ids) // direct lookup
}
To prevent runaway generation under stochastic sampling (temperature = 0.9):
git clone https://github.com/AtomGradient/swift-qwen3-tts.git
cd swift-qwen3-tts
swift run Qwen3TTSDemo \
--model path/to/Qwen3-TTS-0.6B-CustomVoice-4bit-pruned-vocab-lite \
--speaker Aiden \
--text "Hello, this is on-device TTS!" \
--output output.wav
We release two edge-optimized model variants, ready for on-device deployment:
| Model | Size | Compression | Quality |
|---|---|---|---|
| bf16-pruned-vocab-lite | 1.5 GB | Vocab pruning + ST lite | Lossless |
| 4bit-pruned-vocab-lite | 808 MB | + 4-bit quantization | Near-identical |
Both models support 9 speakers (Aiden, Serena, Vivian, Ryan, Uncle Fu, Ono Anna, Sohee, Eric, Dylan) across 12 languages with emotion control.
@article{atomgradient2026efficient,
title={Efficient On-Device Text-to-Speech: A Post-Training Compression
Pipeline for Qwen3 TTS on Apple Silicon},
author={AtomGradient},
year={2026},
url={https://github.com/AtomGradient/swift-qwen3-tts}
}