Memory management

Edge Kit manages model and generation memory automatically, but app-level choices still matter.

Why iOS memory is different

iOS can terminate an app before physical RAM is exhausted. Treat the process footprint and real-device behavior as the source of truth.

For larger models, enable the Increased Memory Limit entitlement.

What Edge Kit manages

Area	Behavior
Model load	Checks the model against the current device before loading.
KV cache	Applies automatic KV cache management for text generation.
Prompt cache	Reuses conversation context across turns.
Memory pressure	Responds to system memory warnings.
Single-shot tasks	Releases temporary buffers after STT and TTS style workloads.

Choose a memory intent

For conversational agents, declare a product-level intent and let Edge Kit plan the low-level cache policy:

Intent	Use when
`.balanced`	Default for most chat sessions.
`.longSession`	You want to preserve more resident context when the device budget allows it.
`.exactRecall`	The session often asks about amounts, dates, counts, or audit-style facts. Use this with app-owned tool or fact-store recall for exact data.
`.batteryFriendly`	You want lower resident-state pressure for thermal or battery-sensitive flows.

let options = NativeRuntimeLoadOptions(memoryIntent: .longSession)
try await engine.loadLocal(directory: modelURL, options: options)

Do not tune DSR windows or memory environment variables as product API. Those knobs are diagnostic and experimental. For exact facts, keep a tool or fact store as the source of truth rather than relying on conversation memory alone.

Conversation cache

Keep the prompt cache for one conversation:

for try await chunk in engine.generate(messages: history) {
    print(chunk.text, terminator: "")
}

Clear it when a new conversation starts:

engine.clearPromptCache()

Unload unused engines

engine.unload()

For TTS, use unloadAsync() if a streaming generation may still be running:

await ttsEngine.unloadAsync()

App best practices

Test on the lowest-memory device you support.
Use Release builds for memory validation.
Avoid loading multiple large engines at the same time.
Downscale images before sending them to a VLM when full resolution is not needed.
Keep long-running generations cancellable.
Use process physical footprint for debugging.

Common symptoms

Symptom	What to check
App exits during load	Model is too large for the target device or entitlement is missing.
First turn is slow	Cold model load or long prompt prefill.
Later turns slow down	Conversation history is growing; summarize or clear when appropriate.
VLM fails on images	Try smaller images and validate device memory.

Why iOS memory is different​

What Edge Kit manages​

Choose a memory intent​

Conversation cache​

Unload unused engines​

App best practices​

Common symptoms​