Use it from Swift
Add the package
Package.swift:
.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),
// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),
Platforms: iOS 18+ / macOS 15+.
Download + chat (one call)
import CoreMLLLM
// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/gemma-4-E4B-coreml")
let stream = try await llm.generate(
[CoreMLLLM.Message(role: .user, content: "Hello!")],
maxTokens: 256
)
for await chunk in stream {
print(chunk, terminator: "")
}
Multi-turn: keep an [CoreMLLLM.Message] array, append the
user/assistant turns, and pass the whole history to
generate(_:) again. Call llm.reset() to start a new
conversation (clears the KV cache).
Gemma 4 E4B β Core ML (INT4, Apple Neural Engine)
Core ML port of google/gemma-4-E4B-it (the 4B-effective Gemma 4 decoder), chunked into 4 sliding-window-attention pieces for Apple Neural Engine. Produced by john-rocky/CoreML-LLM via:
python conversion/build_gemma4_bundle.py --model gemma4-e4b --ctx 2048
Files
chunk1.mlmodelc/ # L0β11 β INT4 palettized, owns its own KV
chunk2.mlmodelc/ # L12β23 β emits kv13_*/kv14_* aliases for producer L22/L23
chunk3.mlmodelc/ # L24β32 β KV-shared
chunk4.mlmodelc/ # L33β41 + lm_head β multi-function (decode_q1 + verify_qK)
embed_tokens_q8.bin 640 MB β INT8 token embeddings (262144 Γ 2560)
embed_tokens_scales.bin 512 KB
embed_tokens_per_layer_q8.bin 2.6 GB β INT8 per-layer embeddings (PLE)
embed_tokens_per_layer_scales.bin 512 KB
per_layer_projection.bin 53 MB β fp16 PLE projection
per_layer_norm_weight.bin 512 B β fp16 PLE norm
cos_full.npy / cos_sliding.npy 4 MB / 2 MB β precomputed RoPE cos
sin_full.npy / sin_sliding.npy 4 MB / 2 MB β precomputed RoPE sin
model_config.json 711 B β runtime config (used by the Swift app's loader)
hf_model/
βββ tokenizer.json
βββ tokenizer_config.json
βββ config.json
βββ generation_config.json
The Swift runtime renames the producer-layer KV outputs kv13_* / kv14_* regardless of actual layer index, so the iOS side needs no model-specific wiring.
Why so many sidecars (vs a single model.mlpackage)?
Gemma 3n / 4 E-series uses a per-layer embedding (PLE) bank that's much larger than the token embedding (2.6 GB vs 640 MB here). Loading PLE through Core ML would dequantize the entire bank into the CPU heap and balloon phys_footprint. We mmap the raw INT8 + scale .bin files instead, dequantize the few rows touched per token in pure Swift, and feed the result to the chunks. The chunks themselves are pure transformer bodies and stay ANE-resident.
The .npy RoPE tables are pre-baked at conversion-time so Swift doesn't need to ship a cos/sin builder.
Tokenizer
Already included in hf_model/. If you prefer the upstream copy:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/gemma-4-E4B-it")
Standalone usage (Python / Mac)
from huggingface_hub import snapshot_download
import coremltools as ct, numpy as np, json
local = snapshot_download("mlboydaisuke/gemma-4-E4B-coreml")
cfg = json.load(open(f"{local}/model_config.json"))
chunks = [ct.models.MLModel(f"{local}/chunk{i}.mlmodelc")
for i in range(1, 5)]
The .mlmodelc directories carry compiled Core ML programs (no compile step on macOS / iPhone; Mac Studio with MLModelConfiguration.computeUnits = .cpuAndNeuralEngine will execute on ANE directly).
For the full PLE-aware decode loop, see Sources/CoreMLLLM/ChunkedEngine.swift β that is the canonical implementation; mirror it in Python by:
- mmap'ing
embed_tokens_q8.bin(uint8) +embed_tokens_scales.bin(fp16) and dequantizing the row for the current token, - mmap'ing
embed_tokens_per_layer_q8.bin+embed_tokens_per_layer_scales.bin(per-layer rows, dequant on demand), - running
chunk1..chunk4, threadingkv*outputs from chunk2 as inputs to chunks 3β4 (KV alias names follow the producer-layer convention).
iOS / Mac app
Pick Gemma 4 E4B in the CoreMLLLMChat model picker β it auto-downloads this repo and runs it via ChunkedEngine.
Architecture (vs E2B)
| E2B | E4B | |
|---|---|---|
num_hidden_layers |
35 | 42 |
hidden_size |
1536 | 2560 |
num_key_value_heads |
1 | 2 |
intermediate_size |
6144 | 10240 |
num_kv_shared_layers |
20 | 18 |
| KV producers (sliding/full) | L13 / L14 | L22 / L23 |
| Chunk boundaries | L0-7, L8-14, L15-24, L25-34 | L0-11, L12-23, L24-32, L33-41 |
Benchmarks
iPhone 17 Pro, INT4 palettized, ctx=2048, no speculative decoding:
| Metric | Value |
|---|---|
| Decode tok/s | ~14 tok/s |
| Per-step latency | ~71 ms |
phys_footprint |
~4.5 GB |
| ANE placement | 100% |
Context length
Shipping bundle is ctx=2048. Rebuild with --ctx 4096 (or higher) on a sufficiently large Mac to extend; the ANE rejects chunks whose declared context differs from model_config.json.
License
Inherits the Gemma terms of use from the base model.
Model tree for mlboydaisuke/gemma-4-E4B-coreml
Base model
google/gemma-4-E4B-it