Skip to main content

Performance tuning

Measure performance with the metrics Edge Kit records after generation. Edge Kit's inference path uses DSR Attention to keep throughput stable across long multi-turn conversations.

Read inference metrics

LLMEngine and VLMEngine expose lastMetrics after a generation completes.

for try await chunk in engine.generate(messages: [.user("Summarize this.")]) {
print(chunk.text, terminator: "")
}

if let metrics = engine.lastMetrics {
print("TTFT:", metrics.ttftMs)
print("Decode TPS:", metrics.decodeTPS)
print("Prompt tokens:", metrics.promptTokenCount)
print("Generated tokens:", metrics.generationTokenCount)
print("Memory delta:", metrics.memoryDeltaMB)
}

Benchmark in Release

Debug builds can be much slower than Release builds. Always collect benchmark numbers from:

  • A Release configuration.
  • The real target device.
  • A cooled device with stable thermal state.
  • The same prompt and model across runs.

Choose the model size first

Larger models can improve quality, but they also increase load time, memory pressure, and per-token cost.

GoalRecommendation
Lowest latencyStart with a 0.8B or small 4-bit model.
Balanced chat qualityStart with a 4B 4-bit model.
Highest local qualityUse a larger model only on validated high-memory devices.
Edge Studio optimization or artifact generationKeep higher-precision source models on a Mac.

Use prompt cache for conversations

Multi-turn LLM and VLM conversations reuse conversation context automatically. Keep the same engine instance for one conversation, then clear the cache when the user starts a new one.

engine.clearPromptCache()

Monitor process footprint

Use process physical footprint when debugging memory pressure. Available-memory APIs can be misleading on iOS because system limits are lower than physical RAM.

Reference benchmarks

Real-device measurements with Qwen3.5 models, 20-turn conversation stress tests:

DeviceModelFirst turnMedianT20TTFT
iPhone 17 (A19, 11GB)9B-4bit12.6 TPS11.6 TPS10.8 TPS566ms
iPhone Air (A19, 11GB)9B-4bit9.5 TPS7.8 TPS7.5 TPS918ms
iPhone 17 (A19, 11GB)4B-4bit21.8 TPS19.6 TPS17.3 TPS420ms

Custom engine prefill vs generic framework (M2 Ultra 192GB):

WorkloadEdge EngineGenericSpeedup
Text prefill (4B)1,305 TPS187 TPS
Text prefill (9B)843 TPS122 TPS6.9×
VLM image prefill (4B)1,803 TPS851 TPS2.1×
VLM image prefill (9B)1,234 TPS511 TPS2.4×

Use these numbers as orientation. Your results will vary based on model, device thermal state, and conversation length.

Practical checklist

  • Use quantized models for interactive inference.
  • Prefer local model directories during development for repeatable tests.
  • Keep one active generation per engine instance.
  • Unload unused engines before switching model categories.
  • Re-run benchmarks after changing model size, build configuration, or target device.