Instructions to use munyew/MERaLiON-2-3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use munyew/MERaLiON-2-3B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="munyew/MERaLiON-2-3B-GGUF", filename="meralion-2-3b-decoder-q4_k_m.gguf", )
llm.create_chat_completion( messages = "\"sample1.flac\"" )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use munyew/MERaLiON-2-3B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf munyew/MERaLiON-2-3B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/munyew/MERaLiON-2-3B-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use munyew/MERaLiON-2-3B-GGUF with Ollama:
ollama run hf.co/munyew/MERaLiON-2-3B-GGUF:Q4_K_M
- Unsloth Studio new
How to use munyew/MERaLiON-2-3B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for munyew/MERaLiON-2-3B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for munyew/MERaLiON-2-3B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for munyew/MERaLiON-2-3B-GGUF to start chatting
- Docker Model Runner
How to use munyew/MERaLiON-2-3B-GGUF with Docker Model Runner:
docker model run hf.co/munyew/MERaLiON-2-3B-GGUF:Q4_K_M
- Lemonade
How to use munyew/MERaLiON-2-3B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull munyew/MERaLiON-2-3B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.MERaLiON-2-3B-GGUF-Q4_K_M
List all available models
lemonade list
MERaLiON-2-3B GGUF โ Quantized mmproj + First Validated Android Deployment
This repository contains GGUF-format files for MERaLiON-2-3B, Singapore's national multimodal speech-language model developed by A*STAR IยฒR under the National Multimodal Language Programme (NMLP).
Relationship to official GGUF release: The official A*STAR team published GGUF files at MERaLiON/MERaLiON-2-3B-GGUF on 18 April 2026, including decoder files and an fp16 mmproj. This repository extends that work with:
- A quantized mmproj (Q4_K_M, 533MB) โ reducing the audio encoder footprint by 68% vs the official fp16 (1.7GB)
- Validated on-device Android deployment โ confirmed working on Honor Magic 8 Pro Air via Termux + llama.cpp Vulkan
- Detailed Android deployment guide โ bridge architecture, Phantom Process Killer workarounds, thermal management
Converted and validated by Mun Yew Loh, 22 April 2026.
What's New vs Official Release
| File | Official MERaLiON/MERaLiON-2-3B-GGUF | This repo |
|---|---|---|
| Decoder Q4_K_M | โ 1.6GB | โ 1.6GB (identical) |
| Decoder Q8_0 | โ 2.6GB | โ not included |
| mmproj F16 | โ 1.7GB | โ 1.7GB (identical) |
| mmproj Q4_K_M | โ not available | โ 533MB (NEW) |
| Android deployment guide | โ not documented | โ fully documented |
| Mobile inference validation | โ not validated | โ Honor Magic 8 Pro Air |
Files
| File | Size | Description |
|---|---|---|
meralion-2-3b-decoder-q4_k_m.gguf |
1.6 GB | Text decoder (Gemma-2-2B), Q4_K_M โ identical to official release |
meralion-2-3b-mmproj-q4_k_m.gguf |
533 MB | Audio encoder + adapter, Q4_K_M โ first quantized mmproj |
meralion-2-3b-mmproj-f16.gguf |
1.7 GB | Audio encoder + adapter, original F16 precision โ identical to official release |
Total footprint (q4 decoder + q4 mmproj): 2.1 GB โ vs 3.3 GB with fp16 mmproj.
Model Architecture
MERaLiON-2-3B combines:
- Speech encoder: Localized Whisper-large-v3 (497 tensors, ~1.28B params)
- Text decoder: Gemma-2-2B-IT (288 tensors, 2.61B params)
- Audio adapter: Custom MERaLiON projector (
meraliontype) bridging encoder โ decoder - Projector stack factor: 15 (audio compression ratio)
- Audio sample rate: 16,000 Hz, mono
- Mel bins: 128
Supports: ASR, Spoken QA, Paralinguistic Analysis (emotion/tone/stress), Code-switching (Singlish, Mandarin-English, Malay-English), multilingual SEA speech understanding.
mmproj Quantization Notes
The official release provides mmproj in fp16 only (1.7GB). This repository adds a Q4_K_M quantized mmproj (533MB). Key findings from the quantization process:
- Q8_0 fails โ conv1d tensors in the Whisper encoder are not divisible by 32, causing
llama_model_quantize: failed to quantize: no tensor type fallback is defined for type q8_0 - Q4_K_M succeeds with 2 fallback tensors (minor precision impact on those 2 tensors)
- 68% size reduction: 1638.37 MiB โ 532.90 MiB
- Inference quality: validated equivalent transcription accuracy on Singlish test phrases
Quantization command:
llama-quantize meralion-2-3b-mmproj-f16.gguf \
meralion-2-3b-mmproj-q4_k_m.gguf Q4_K_M
# Output: quant size = 532.90 MiB (5.22 BPW), 2 fallback tensors
Android Deployment
MERaLiON-2-3B has been validated running fully on-device on an Honor Magic 8 Pro Air via Termux + llama.cpp with Vulkan GPU acceleration. This is the first publicly documented Android deployment of MERaLiON-2-3B.
Hardware
| Component | Spec |
|---|---|
| Device | Honor Magic 8 Pro Air (่ฃ่Magic8 Pro Air) |
| SoC | Snapdragon 8 Elite Gen 5 |
| RAM | 16GB |
| GPU | Mali-G1-Ultra MC12 (Vulkan 1.4.349) |
| GPU VRAM | 15,405 MiB |
| OS | HarmonyOS (Android-compatible) |
| Storage required | ~2.1GB (q4 decoder + q4 mmproj) |
Software Stack
- Runtime: llama.cpp build b8762+ (MERaLiON support merged via PR #21756)
- GPU backend: Vulkan (
-ngl 99, 27/27 layers offloaded to Mali GPU) - Inference wrapper: Flask bridge on port 8081 (Python 3, Termux)
- Frontend: Native Android APK + PWA fallback (Chrome)
Build Instructions (Android/Termux)
# Install dependencies
pkg install cmake clang python git vulkan-tools shaderc spirv-headers
# Clone and build with Vulkan
git clone https://github.com/ggml-org/llama.cpp --depth=1
cd llama.cpp
cmake -B build-vulkan -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
cmake --build build-vulkan --config Release -j4
# Verify GPU is visible
./build-vulkan/bin/llama-mtmd-cli --list-devices
# Expected: Vulkan0: Mali-G1-Ultra MC12 (15405 MiB, 15405 MiB free)
Inference Command
llama-mtmd-cli \
-m meralion-2-3b-decoder-q4_k_m.gguf \
--mmproj meralion-2-3b-mmproj-q4_k_m.gguf \
--audio input.wav \
-p "Follow the text instruction based on the following audio: <__media__>\nTranscribe exactly what was said." \
-n 80 -e -ngl 99
Performance (Honor Magic 8 Pro Air, cool device)
| Metric | Value |
|---|---|
| GPU layers offloaded | 27/27 (full offload) |
| Audio encoding time | ~18s |
| Prompt eval time | ~25s (5.1 tok/s) |
| Token generation | ~1s (20 tok/s) |
| Total ASR response | ~34โ36s |
| Thermal throttle onset | ~3 consecutive requests |
Known Android Limitations
- Thermal throttling โ sustained inference causes slowdown after ~2-3 requests. Allow 10-15 min cooldown between heavy sessions.
- Phantom Process Killer โ Android 12+ kills child processes after ~60s.
Fix: run llama-mtmd-cli via
nohupshell script withstart_new_session=True
sh_file = tempfile.mktemp(suffix='.sh')
with open(sh_file, 'w') as f:
f.write(f'#!/bin/sh\n{cmd} > {out_file} 2>&1\n')
os.chmod(sh_file, 0o755)
proc = subprocess.Popen(['nohup', sh_file], start_new_session=True,
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
- HarmonyOS adb incompatibility โ Honor devices use HDB protocol;
standard
adb pairwireless pairing fails. Use USB file transfer for APK installation. - Reply generation via silence.wav โ second llama-mtmd-cli call for response generation returns empty output in current builds. Use llama-server separately for text generation, or implement keyword-based fallback.
Validated Singlish Transcription Samples
| Input phrase | Transcript output |
|---|---|
| "eh MINA good morning lah" | "Hey, Mina, good morning, lah." |
| "walao I very stressed cannot take it anymore" | "wow I very stressed. Can't take it anymore. lah." |
| "I feel very lonely nobody wants to talk to me sia" | "I feel very lonely. Nobody wants to talk to me, sia." |
| "my dog just passed away I very sad" | "err, mina, my dog just passed away. I'm very sad." |
| "good afternoon mina wah today very hot sia" | "good afternoon, mina wah today very hot sia." |
Paralinguistic Capabilities
MERaLiON-2-3B detects paralinguistic features from voice alongside transcription:
- Emotion: happy, calm, stressed, anxious, sad, angry, fearful
- Speech signals: fast speech, hesitations, voice tremors
- Distress indicators: risk level (none/low/medium/high)
Particularly suitable for:
- Mental health helpline operator support (MOH-T National Mental Helpline 1771)
- Scam call detection and voice risk scoring
- Healthcare consultation transcription with emotional context
- Elderly companionship and digital inclusion applications
Relationship to Official Release
This repository is an independent community contribution extending the official MERaLiON/MERaLiON-2-3B-GGUF release. All model weights are derived from the original MERaLiON/MERaLiON-2-3B safetensors published by A*STAR IยฒR. The decoder GGUF and fp16 mmproj files in this repository are functionally identical to those in the official release โ only the Q4_K_M mmproj is new.
Users should refer to the official repository for the most up-to-date releases and A*STAR-supported files.
Acknowledgements
- MERaLiON model: A*STAR Institute for Infocomm Research (IยฒR)
- Official GGUF release: MERaLiON/MERaLiON-2-3B-GGUF
- llama.cpp MERaLiON support: PR #21756 (merged build b8762)
- Contributor: Mun Yew Loh, April 2026
This repository is subject to the MERaLiON Public License v3. All derivative works must comply with the original license terms.
Citation
If you use the quantized mmproj or Android deployment guide from this repository:
@misc{loh2026meralion-android,
title={MERaLiON-2-3B: Quantized mmproj and Android Deployment},
author={Loh, Mun Yew},
year={2026},
url={https://huggingface.co/munyew/MERaLiON-2-3B-GGUF},
note={Quantized mmproj and Android deployment guide for MERaLiON/MERaLiON-2-3B-GGUF}
}
Please also cite the original MERaLiON papers listed at MERaLiON/MERaLiON-2-3B-GGUF.
Android Deployment with Watchdog
The Phantom Process Killer in Android 12+ kills llama-server and bridge.py
aggressively, requiring constant manual restarts.
Use start_mina.sh (available in munyew/roar-companion)
for an automatic restart watchdog that keeps both processes alive all day:
chmod +x ~/start_mina.sh
~/start_mina.sh
The watchdog script:
- Kills any stale
llama-server/bridge.pyprocesses on launch - Acquires a
termux-wake-lockto prevent CPU sleep - Starts
llama-serverwith full Vulkan offload (-ngl 99) - Waits 20 s for model load, then starts
bridge.pyvianohup - Enters an infinite loop โ checks every 10 s and auto-restarts either process if Android kills it
- Logs every restart with a timestamp
Tested on Honor Magic 8 Pro Air (Snapdragon 8 Elite, 16 GB RAM).
The MINA Android companion app detects bridge loss automatically and shows
Reconnecting... โ โ Ready without any user action.
- Downloads last month
- 414
4-bit
8-bit