Instructions to use munyew/MERaLiON-2-3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use munyew/MERaLiON-2-3B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="munyew/MERaLiON-2-3B-GGUF",
	filename="meralion-2-3b-decoder-q4_k_m.gguf",
)

llm.create_chat_completion(
	messages = "\"sample1.flac\""
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use munyew/MERaLiON-2-3B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/munyew/MERaLiON-2-3B-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use munyew/MERaLiON-2-3B-GGUF with Ollama:
```
ollama run hf.co/munyew/MERaLiON-2-3B-GGUF:Q4_K_M
```

Unsloth Studio new

How to use munyew/MERaLiON-2-3B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for munyew/MERaLiON-2-3B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for munyew/MERaLiON-2-3B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for munyew/MERaLiON-2-3B-GGUF to start chatting

Docker Model Runner
How to use munyew/MERaLiON-2-3B-GGUF with Docker Model Runner:
```
docker model run hf.co/munyew/MERaLiON-2-3B-GGUF:Q4_K_M
```

Lemonade

How to use munyew/MERaLiON-2-3B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull munyew/MERaLiON-2-3B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.MERaLiON-2-3B-GGUF-Q4_K_M

List all available models

lemonade list

MERaLiON-2-3B GGUF — Quantized mmproj + First Validated Android Deployment

This repository contains GGUF-format files for MERaLiON-2-3B, Singapore's national multimodal speech-language model developed by A*STAR I²R under the National Multimodal Language Programme (NMLP).

Relationship to official GGUF release: The official A*STAR team published GGUF files at MERaLiON/MERaLiON-2-3B-GGUF on 18 April 2026, including decoder files and an fp16 mmproj. This repository extends that work with:

A quantized mmproj (Q4_K_M, 533MB) — reducing the audio encoder footprint by 68% vs the official fp16 (1.7GB)

Validated on-device Android deployment — confirmed working on Honor Magic 8 Pro Air via Termux + llama.cpp Vulkan

Detailed Android deployment guide — bridge architecture, Phantom Process Killer workarounds, thermal management

Converted and validated by Mun Yew Loh, 22 April 2026.

What's New vs Official Release

File	Official MERaLiON/MERaLiON-2-3B-GGUF	This repo
Decoder Q4_K_M	✅ 1.6GB	✅ 1.6GB (identical)
Decoder Q8_0	✅ 2.6GB	❌ not included
mmproj F16	✅ 1.7GB	✅ 1.7GB (identical)
mmproj Q4_K_M	❌ not available	✅ 533MB (NEW)
Android deployment guide	❌ not documented	✅ fully documented
Mobile inference validation	❌ not validated	✅ Honor Magic 8 Pro Air

Files

File	Size	Description
`meralion-2-3b-decoder-q4_k_m.gguf`	1.6 GB	Text decoder (Gemma-2-2B), Q4_K_M — identical to official release
`meralion-2-3b-mmproj-q4_k_m.gguf`	533 MB	Audio encoder + adapter, Q4_K_M — first quantized mmproj
`meralion-2-3b-mmproj-f16.gguf`	1.7 GB	Audio encoder + adapter, original F16 precision — identical to official release

Total footprint (q4 decoder + q4 mmproj): 2.1 GB — vs 3.3 GB with fp16 mmproj.

Model Architecture

MERaLiON-2-3B combines:

Speech encoder: Localized Whisper-large-v3 (497 tensors, ~1.28B params)
Text decoder: Gemma-2-2B-IT (288 tensors, 2.61B params)
Audio adapter: Custom MERaLiON projector (meralion type) bridging encoder → decoder
Projector stack factor: 15 (audio compression ratio)
Audio sample rate: 16,000 Hz, mono
Mel bins: 128

Supports: ASR, Spoken QA, Paralinguistic Analysis (emotion/tone/stress), Code-switching (Singlish, Mandarin-English, Malay-English), multilingual SEA speech understanding.

mmproj Quantization Notes

The official release provides mmproj in fp16 only (1.7GB). This repository adds a Q4_K_M quantized mmproj (533MB). Key findings from the quantization process:

Q8_0 fails — conv1d tensors in the Whisper encoder are not divisible by 32, causing llama_model_quantize: failed to quantize: no tensor type fallback is defined for type q8_0
Q4_K_M succeeds with 2 fallback tensors (minor precision impact on those 2 tensors)
68% size reduction: 1638.37 MiB → 532.90 MiB
Inference quality: validated equivalent transcription accuracy on Singlish test phrases

Quantization command:

llama-quantize meralion-2-3b-mmproj-f16.gguf \
               meralion-2-3b-mmproj-q4_k_m.gguf Q4_K_M
# Output: quant size = 532.90 MiB (5.22 BPW), 2 fallback tensors

Android Deployment

MERaLiON-2-3B has been validated running fully on-device on an Honor Magic 8 Pro Air via Termux + llama.cpp with Vulkan GPU acceleration. This is the first publicly documented Android deployment of MERaLiON-2-3B.

Hardware

Component	Spec
Device	Honor Magic 8 Pro Air (荣耀Magic8 Pro Air)
SoC	Snapdragon 8 Elite Gen 5
RAM	16GB
GPU	Mali-G1-Ultra MC12 (Vulkan 1.4.349)
GPU VRAM	15,405 MiB
OS	HarmonyOS (Android-compatible)
Storage required	~2.1GB (q4 decoder + q4 mmproj)

Software Stack

Runtime: llama.cpp build b8762+ (MERaLiON support merged via PR #21756)
GPU backend: Vulkan (-ngl 99, 27/27 layers offloaded to Mali GPU)
Inference wrapper: Flask bridge on port 8081 (Python 3, Termux)
Frontend: Native Android APK + PWA fallback (Chrome)

Build Instructions (Android/Termux)

# Install dependencies
pkg install cmake clang python git vulkan-tools shaderc spirv-headers

# Clone and build with Vulkan
git clone https://github.com/ggml-org/llama.cpp --depth=1
cd llama.cpp
cmake -B build-vulkan -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
cmake --build build-vulkan --config Release -j4

# Verify GPU is visible
./build-vulkan/bin/llama-mtmd-cli --list-devices
# Expected: Vulkan0: Mali-G1-Ultra MC12 (15405 MiB, 15405 MiB free)

Inference Command

llama-mtmd-cli \
  -m meralion-2-3b-decoder-q4_k_m.gguf \
  --mmproj meralion-2-3b-mmproj-q4_k_m.gguf \
  --audio input.wav \
  -p "Follow the text instruction based on the following audio: <__media__>\nTranscribe exactly what was said." \
  -n 80 -e -ngl 99

Performance (Honor Magic 8 Pro Air, cool device)

Metric	Value
GPU layers offloaded	27/27 (full offload)
Audio encoding time	~18s
Prompt eval time	~25s (5.1 tok/s)
Token generation	~1s (20 tok/s)
Total ASR response	~34–36s
Thermal throttle onset	~3 consecutive requests

Known Android Limitations

Thermal throttling — sustained inference causes slowdown after ~2-3 requests. Allow 10-15 min cooldown between heavy sessions.
Phantom Process Killer — Android 12+ kills child processes after ~60s. Fix: run llama-mtmd-cli via nohup shell script with start_new_session=True

   sh_file = tempfile.mktemp(suffix='.sh')
   with open(sh_file, 'w') as f:
       f.write(f'#!/bin/sh\n{cmd} > {out_file} 2>&1\n')
   os.chmod(sh_file, 0o755)
   proc = subprocess.Popen(['nohup', sh_file], start_new_session=True,
                           stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

HarmonyOS adb incompatibility — Honor devices use HDB protocol; standard adb pair wireless pairing fails. Use USB file transfer for APK installation.
Reply generation via silence.wav — second llama-mtmd-cli call for response generation returns empty output in current builds. Use llama-server separately for text generation, or implement keyword-based fallback.

Validated Singlish Transcription Samples

Input phrase	Transcript output
"eh MINA good morning lah"	"Hey, Mina, good morning, lah."
"walao I very stressed cannot take it anymore"	"wow I very stressed. Can't take it anymore. lah."
"I feel very lonely nobody wants to talk to me sia"	"I feel very lonely. Nobody wants to talk to me, sia."
"my dog just passed away I very sad"	"err, mina, my dog just passed away. I'm very sad."
"good afternoon mina wah today very hot sia"	"good afternoon, mina wah today very hot sia."

Paralinguistic Capabilities

MERaLiON-2-3B detects paralinguistic features from voice alongside transcription:

Emotion: happy, calm, stressed, anxious, sad, angry, fearful
Speech signals: fast speech, hesitations, voice tremors
Distress indicators: risk level (none/low/medium/high)

Particularly suitable for:

Mental health helpline operator support (MOH-T National Mental Helpline 1771)
Scam call detection and voice risk scoring
Healthcare consultation transcription with emotional context
Elderly companionship and digital inclusion applications

Relationship to Official Release

This repository is an independent community contribution extending the official MERaLiON/MERaLiON-2-3B-GGUF release. All model weights are derived from the original MERaLiON/MERaLiON-2-3B safetensors published by A*STAR I²R. The decoder GGUF and fp16 mmproj files in this repository are functionally identical to those in the official release — only the Q4_K_M mmproj is new.

Users should refer to the official repository for the most up-to-date releases and A*STAR-supported files.

Acknowledgements

MERaLiON model: A*STAR Institute for Infocomm Research (I²R)
Official GGUF release: MERaLiON/MERaLiON-2-3B-GGUF
llama.cpp MERaLiON support: PR #21756 (merged build b8762)
Contributor: Mun Yew Loh, April 2026

This repository is subject to the MERaLiON Public License v3. All derivative works must comply with the original license terms.

Citation

If you use the quantized mmproj or Android deployment guide from this repository:

@misc{loh2026meralion-android,
  title={MERaLiON-2-3B: Quantized mmproj and Android Deployment},
  author={Loh, Mun Yew},
  year={2026},
  url={https://huggingface.co/munyew/MERaLiON-2-3B-GGUF},
  note={Quantized mmproj and Android deployment guide for MERaLiON/MERaLiON-2-3B-GGUF}
}

Please also cite the original MERaLiON papers listed at MERaLiON/MERaLiON-2-3B-GGUF.

Android Deployment with Watchdog

The Phantom Process Killer in Android 12+ kills llama-server and bridge.py aggressively, requiring constant manual restarts.

Use start_mina.sh (available in munyew/roar-companion) for an automatic restart watchdog that keeps both processes alive all day:

chmod +x ~/start_mina.sh
~/start_mina.sh

The watchdog script:

Kills any stale llama-server / bridge.py processes on launch
Acquires a termux-wake-lock to prevent CPU sleep
Starts llama-server with full Vulkan offload (-ngl 99)
Waits 20 s for model load, then starts bridge.py via nohup
Enters an infinite loop — checks every 10 s and auto-restarts either process if Android kills it
Logs every restart with a timestamp

Tested on Honor Magic 8 Pro Air (Snapdragon 8 Elite, 16 GB RAM).
The MINA Android companion app detects bridge loss automatically and shows Reconnecting... → ● Ready without any user action.