Use it from Swift

Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

Download + chat (one call)

import CoreMLLLM

// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/gemma-4-E4B-coreml")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream {
    print(chunk, terminator: "")
}

Multi-turn: keep an [CoreMLLLM.Message] array, append the user/assistant turns, and pass the whole history to generate(_:) again. Call llm.reset() to start a new conversation (clears the KV cache).

Gemma 4 E4B — Core ML (INT4, Apple Neural Engine)

Core ML port of google/gemma-4-E4B-it (the 4B-effective Gemma 4 decoder), chunked into 4 sliding-window-attention pieces for Apple Neural Engine. Produced by john-rocky/CoreML-LLM via:

python conversion/build_gemma4_bundle.py --model gemma4-e4b --ctx 2048

Files

chunk1.mlmodelc/   # L0–11   — INT4 palettized, owns its own KV
chunk2.mlmodelc/   # L12–23  — emits kv13_*/kv14_* aliases for producer L22/L23
chunk3.mlmodelc/   # L24–32  — KV-shared
chunk4.mlmodelc/   # L33–41 + lm_head — multi-function (decode_q1 + verify_qK)

embed_tokens_q8.bin              640 MB   — INT8 token embeddings (262144 × 2560)
embed_tokens_scales.bin          512 KB
embed_tokens_per_layer_q8.bin    2.6 GB   — INT8 per-layer embeddings (PLE)
embed_tokens_per_layer_scales.bin 512 KB
per_layer_projection.bin         53 MB    — fp16 PLE projection
per_layer_norm_weight.bin        512 B    — fp16 PLE norm
cos_full.npy / cos_sliding.npy   4 MB / 2 MB — precomputed RoPE cos
sin_full.npy / sin_sliding.npy   4 MB / 2 MB — precomputed RoPE sin

model_config.json    711 B  — runtime config (used by the Swift app's loader)
hf_model/
  ├── tokenizer.json
  ├── tokenizer_config.json
  ├── config.json
  └── generation_config.json

The Swift runtime renames the producer-layer KV outputs kv13_* / kv14_* regardless of actual layer index, so the iOS side needs no model-specific wiring.

Why so many sidecars (vs a single `model.mlpackage`)?

Gemma 3n / 4 E-series uses a per-layer embedding (PLE) bank that's much larger than the token embedding (2.6 GB vs 640 MB here). Loading PLE through Core ML would dequantize the entire bank into the CPU heap and balloon phys_footprint. We mmap the raw INT8 + scale .bin files instead, dequantize the few rows touched per token in pure Swift, and feed the result to the chunks. The chunks themselves are pure transformer bodies and stay ANE-resident.

The .npy RoPE tables are pre-baked at conversion-time so Swift doesn't need to ship a cos/sin builder.

Tokenizer

Already included in hf_model/. If you prefer the upstream copy:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/gemma-4-E4B-it")

Standalone usage (Python / Mac)

from huggingface_hub import snapshot_download
import coremltools as ct, numpy as np, json

local = snapshot_download("mlboydaisuke/gemma-4-E4B-coreml")
cfg   = json.load(open(f"{local}/model_config.json"))
chunks = [ct.models.MLModel(f"{local}/chunk{i}.mlmodelc")
          for i in range(1, 5)]

The .mlmodelc directories carry compiled Core ML programs (no compile step on macOS / iPhone; Mac Studio with MLModelConfiguration.computeUnits = .cpuAndNeuralEngine will execute on ANE directly).

For the full PLE-aware decode loop, see Sources/CoreMLLLM/ChunkedEngine.swift — that is the canonical implementation; mirror it in Python by:

mmap'ing embed_tokens_q8.bin (uint8) + embed_tokens_scales.bin (fp16) and dequantizing the row for the current token,
mmap'ing embed_tokens_per_layer_q8.bin + embed_tokens_per_layer_scales.bin (per-layer rows, dequant on demand),
running chunk1..chunk4, threading kv* outputs from chunk2 as inputs to chunks 3–4 (KV alias names follow the producer-layer convention).

iOS / Mac app

Pick Gemma 4 E4B in the CoreMLLLMChat model picker — it auto-downloads this repo and runs it via ChunkedEngine.

Architecture (vs E2B)

	E2B	E4B
`num_hidden_layers`	35	42
`hidden_size`	1536	2560
`num_key_value_heads`	1	2
`intermediate_size`	6144	10240
`num_kv_shared_layers`	20	18
KV producers (sliding/full)	L13 / L14	L22 / L23
Chunk boundaries	L0-7, L8-14, L15-24, L25-34	L0-11, L12-23, L24-32, L33-41

Benchmarks

iPhone 17 Pro, INT4 palettized, ctx=2048, no speculative decoding:

Metric	Value
Decode tok/s	~14 tok/s
Per-step latency	~71 ms
`phys_footprint`	~4.5 GB
ANE placement	100%

Context length

Shipping bundle is ctx=2048. Rebuild with --ctx 4096 (or higher) on a sufficiently large Mac to extend; the ANE rejects chunks whose declared context differs from model_config.json.

License

Inherits the Gemma terms of use from the base model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlboydaisuke/gemma-4-E4B-coreml

Base model

google/gemma-4-E4B-it

Finetuned

(125)

this model