Use it from Swift

Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

Download + chat (one call)

import CoreMLLLM

// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/gemma-4-E4B-coreml")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream {
    print(chunk, terminator: "")
}

Multi-turn: keep an [CoreMLLLM.Message] array, append the user/assistant turns, and pass the whole history to generate(_:) again. Call llm.reset() to start a new conversation (clears the KV cache).

Gemma 4 E4B β€” Core ML (INT4, Apple Neural Engine)

Core ML port of google/gemma-4-E4B-it (the 4B-effective Gemma 4 decoder), chunked into 4 sliding-window-attention pieces for Apple Neural Engine. Produced by john-rocky/CoreML-LLM via:

python conversion/build_gemma4_bundle.py --model gemma4-e4b --ctx 2048

Files

chunk1.mlmodelc/   # L0–11   β€” INT4 palettized, owns its own KV
chunk2.mlmodelc/   # L12–23  β€” emits kv13_*/kv14_* aliases for producer L22/L23
chunk3.mlmodelc/   # L24–32  β€” KV-shared
chunk4.mlmodelc/   # L33–41 + lm_head β€” multi-function (decode_q1 + verify_qK)

embed_tokens_q8.bin              640 MB   β€” INT8 token embeddings (262144 Γ— 2560)
embed_tokens_scales.bin          512 KB
embed_tokens_per_layer_q8.bin    2.6 GB   β€” INT8 per-layer embeddings (PLE)
embed_tokens_per_layer_scales.bin 512 KB
per_layer_projection.bin         53 MB    β€” fp16 PLE projection
per_layer_norm_weight.bin        512 B    β€” fp16 PLE norm
cos_full.npy / cos_sliding.npy   4 MB / 2 MB β€” precomputed RoPE cos
sin_full.npy / sin_sliding.npy   4 MB / 2 MB β€” precomputed RoPE sin

model_config.json    711 B  β€” runtime config (used by the Swift app's loader)
hf_model/
  β”œβ”€β”€ tokenizer.json
  β”œβ”€β”€ tokenizer_config.json
  β”œβ”€β”€ config.json
  └── generation_config.json

The Swift runtime renames the producer-layer KV outputs kv13_* / kv14_* regardless of actual layer index, so the iOS side needs no model-specific wiring.

Why so many sidecars (vs a single model.mlpackage)?

Gemma 3n / 4 E-series uses a per-layer embedding (PLE) bank that's much larger than the token embedding (2.6 GB vs 640 MB here). Loading PLE through Core ML would dequantize the entire bank into the CPU heap and balloon phys_footprint. We mmap the raw INT8 + scale .bin files instead, dequantize the few rows touched per token in pure Swift, and feed the result to the chunks. The chunks themselves are pure transformer bodies and stay ANE-resident.

The .npy RoPE tables are pre-baked at conversion-time so Swift doesn't need to ship a cos/sin builder.

Tokenizer

Already included in hf_model/. If you prefer the upstream copy:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/gemma-4-E4B-it")

Standalone usage (Python / Mac)

from huggingface_hub import snapshot_download
import coremltools as ct, numpy as np, json

local = snapshot_download("mlboydaisuke/gemma-4-E4B-coreml")
cfg   = json.load(open(f"{local}/model_config.json"))
chunks = [ct.models.MLModel(f"{local}/chunk{i}.mlmodelc")
          for i in range(1, 5)]

The .mlmodelc directories carry compiled Core ML programs (no compile step on macOS / iPhone; Mac Studio with MLModelConfiguration.computeUnits = .cpuAndNeuralEngine will execute on ANE directly).

For the full PLE-aware decode loop, see Sources/CoreMLLLM/ChunkedEngine.swift β€” that is the canonical implementation; mirror it in Python by:

  1. mmap'ing embed_tokens_q8.bin (uint8) + embed_tokens_scales.bin (fp16) and dequantizing the row for the current token,
  2. mmap'ing embed_tokens_per_layer_q8.bin + embed_tokens_per_layer_scales.bin (per-layer rows, dequant on demand),
  3. running chunk1..chunk4, threading kv* outputs from chunk2 as inputs to chunks 3–4 (KV alias names follow the producer-layer convention).

iOS / Mac app

Pick Gemma 4 E4B in the CoreMLLLMChat model picker β€” it auto-downloads this repo and runs it via ChunkedEngine.

Architecture (vs E2B)

E2B E4B
num_hidden_layers 35 42
hidden_size 1536 2560
num_key_value_heads 1 2
intermediate_size 6144 10240
num_kv_shared_layers 20 18
KV producers (sliding/full) L13 / L14 L22 / L23
Chunk boundaries L0-7, L8-14, L15-24, L25-34 L0-11, L12-23, L24-32, L33-41

Benchmarks

iPhone 17 Pro, INT4 palettized, ctx=2048, no speculative decoding:

Metric Value
Decode tok/s ~14 tok/s
Per-step latency ~71 ms
phys_footprint ~4.5 GB
ANE placement 100%

Context length

Shipping bundle is ctx=2048. Rebuild with --ctx 4096 (or higher) on a sufficiently large Mac to extend; the ANE rejects chunks whose declared context differs from model_config.json.

License

Inherits the Gemma terms of use from the base model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/gemma-4-E4B-coreml

Finetuned
(125)
this model