Instructions to use batiai/gemma-4-E4B-it-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use batiai/gemma-4-E4B-it-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="batiai/gemma-4-E4B-it-GGUF", filename="google-gemma-4-E4B-it-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use batiai/gemma-4-E4B-it-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf batiai/gemma-4-E4B-it-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf batiai/gemma-4-E4B-it-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf batiai/gemma-4-E4B-it-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf batiai/gemma-4-E4B-it-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf batiai/gemma-4-E4B-it-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf batiai/gemma-4-E4B-it-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf batiai/gemma-4-E4B-it-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf batiai/gemma-4-E4B-it-GGUF:Q4_K_M
Use Docker
docker model run hf.co/batiai/gemma-4-E4B-it-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use batiai/gemma-4-E4B-it-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "batiai/gemma-4-E4B-it-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "batiai/gemma-4-E4B-it-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/batiai/gemma-4-E4B-it-GGUF:Q4_K_M
- Ollama
How to use batiai/gemma-4-E4B-it-GGUF with Ollama:
ollama run hf.co/batiai/gemma-4-E4B-it-GGUF:Q4_K_M
- Unsloth Studio
How to use batiai/gemma-4-E4B-it-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for batiai/gemma-4-E4B-it-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for batiai/gemma-4-E4B-it-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for batiai/gemma-4-E4B-it-GGUF to start chatting
- Pi
How to use batiai/gemma-4-E4B-it-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf batiai/gemma-4-E4B-it-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "batiai/gemma-4-E4B-it-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use batiai/gemma-4-E4B-it-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf batiai/gemma-4-E4B-it-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default batiai/gemma-4-E4B-it-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use batiai/gemma-4-E4B-it-GGUF with Docker Model Runner:
docker model run hf.co/batiai/gemma-4-E4B-it-GGUF:Q4_K_M
- Lemonade
How to use batiai/gemma-4-E4B-it-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull batiai/gemma-4-E4B-it-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.gemma-4-E4B-it-GGUF-Q4_K_M
List all available models
lemonade list
Gemma 4 E4B-it GGUF โ Quantized by BatiAI
Optimized GGUF quantizations of google/gemma-4-E4B-it โ the larger Edge variant of Gemma 4 with full multimodal support (text + image + audio). Built directly from official Google BF16 weights by BatiAI for BatiFlow.
E4B doubles the parameters of E2B while keeping audio support. Best small-Mac choice when you want voice + image + text but need more reasoning headroom than E2B.
Quick Start
# Recommended
ollama pull batiai/gemma4-e4b:q4
# Higher quality
ollama pull batiai/gemma4-e4b:q6
Available Quantizations
| Tag | Quant | Size | Recommended For |
|---|---|---|---|
:q4 |
Q4_K_M | ~2.8 GB | balanced (recommended default) |
:q6 |
Q6_K | ~3.6 GB | higher quality, near-lossless |
Two modes โ text-only by default, multimodal opt-in
Upstream Gemma 4 E4B-it is fully multimodal โ text + image + audio in one model. In the GGUF ecosystem this is delivered as two files: a main model.gguf (text tower) and a separate mmproj.gguf that holds both vision and audio encoders together (a single 1411-tensor projector covering image and speech input).
| Text-only (default) | Multimodal (opt-in) | |
|---|---|---|
| Files needed | main GGUF only | main GGUF + mmproj-BF16.gguf |
| Capabilities | Q&A, coding, tool calling, agents | + image (OCR, captioning, visual reasoning) + audio (speech understanding) |
ollama pull |
โ single command | โ Ollama mmproj integration is still rough โ use llama.cpp directly |
| Disk / RAM | smaller (no projector weights) | larger (+ ~946 MB) |
| Recommended for | most users (chat, code, agents) | OCR, image / speech understanding |
Multimodal usage (llama.cpp)
# Pick a main model (text tower)
wget https://huggingface.co/batiai/Gemma-4-E4B-it-GGUF/resolve/main/google-gemma-4-E4B-it-Q4_K_M.gguf
# Get the multimodal projector (vision + audio in one file)
wget https://huggingface.co/batiai/Gemma-4-E4B-it-GGUF/resolve/main/mmproj-BF16.gguf
Server mode (image input via OpenAI-compatible Vision API):
llama-server \
-m google-gemma-4-E4B-it-Q4_K_M.gguf \
--mmproj mmproj-BF16.gguf \
-c 32768 --host 127.0.0.1 --port 8080
curl http://127.0.0.1:8080/v1/chat/completions -d '{
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
{"type": "text", "text": "What does this show?"}
]
}]
}'
One-shot CLI with image / audio:
# Image
llama-mtmd-cli -m google-gemma-4-E4B-it-Q4_K_M.gguf --mmproj mmproj-BF16.gguf \
--image ~/Desktop/photo.jpg -p "describe this image"
# Audio
llama-mtmd-cli -m google-gemma-4-E4B-it-Q4_K_M.gguf --mmproj mmproj-BF16.gguf \
--audio ~/Downloads/voice.wav -p "transcribe and summarize"
mmproj quantization
| File | Quant | Size | Why this only? |
|---|---|---|---|
mmproj-BF16.gguf |
BF16 | ~946 MB | The combined vision + audio projector tensors don't satisfy K-quant block alignment, so Q6_K aborts on this projector. BF16 is the only safe choice today โ applies to every quantizer of this model. The main text GGUF is unaffected. |
Why E series (E2B / E4B) vs 26B / 31B?
| E2B / E4B | 26B-A4B / 31B | |
|---|---|---|
| Audio support | โ | โ (vision only) |
| Min RAM | 8 GB+ | 24 GB+ |
| Speed | very fast (small) | slower (larger) |
| Reasoning depth | lower | higher |
| Use case | edge / mobile-class Mac / voice + image | desktop chat + agents |
If you need voice + image + text in one model on a small Mac, the E series is the only Gemma 4 option. E4B if you want a bit more capability than E2B.
Why BatiAI?
| BatiAI | Third-party (unsloth, etc.) | |
|---|---|---|
| Source | Quantized directly from official Google weights | Often re-quantized from other GGUFs |
| Compatibility | โ Verified on Ollama 0.20+ | โ Known issues with Ollama 0.20+ |
| Tested on | Real Mac mini M4 (16 GB) + MacBook Pro M4 Max | Often untested |
| Tool calling | โ Verified with BatiFlow's 57 tool functions | Often broken |
| Korean | โ Validated | Not tested |
| Multimodal | โ Vision + audio mmproj available | Often missing |
| Signing | general.author: BatiAI for provenance |
Varies |
About BatiFlow
BatiFlow is a macOS-native AI desktop automation app โ just 5 MB, built with Swift.
- Free & Unlimited โ On-device AI via Ollama, no API costs
- 100 % Private โ All data stays on your Mac
- Ultra Lightweight โ Native macOS app, only 5 MB
- 57 built-in tools โ calendar, notes, reminders, files, email, browser, messaging, and more
Related models in the BatiAI Gemma 4 lineup
| Model | Modalities | Min RAM | Repo |
|---|---|---|---|
| Gemma 4 E2B-it | text + image + audio | 8 GB | batiai/Gemma-4-E2B-it-GGUF |
| Gemma 4 E4B-it | text + image + audio | 8 GB | this repo |
| Gemma 4 26B-A4B-it | text + image / video (no audio) | 24 GB | batiai/Gemma-4-26B-A4B-it-GGUF |
| Gemma 4 31B-it | text + image / video (no audio) | 24 GB | batiai/Gemma-4-31B-it-GGUF |
Technical Details
- Original Model: google/gemma-4-E4B-it
- Architecture: Gemma 4 (Edge variant)
- Modalities: Text (primary) + Image + Audio (via opt-in mmproj)
- License: Gemma (commercial use permitted under terms)
- Quantized with: llama.cpp
- Quantized by: BatiAI
- GGUF metadata:
general.author: BatiAI,general.url: https://flow.bati.ai
License
Mirrors the upstream Gemma license. Commercial use permitted per Google's Gemma terms.
BatiAI's quantization pipeline is provided under MIT.
- Downloads last month
- 924
4-bit
6-bit