MERaLiON-2-3B GGUF โ€” Quantized mmproj + First Validated Android Deployment

This repository contains GGUF-format files for MERaLiON-2-3B, Singapore's national multimodal speech-language model developed by A*STAR IยฒR under the National Multimodal Language Programme (NMLP).

Relationship to official GGUF release: The official A*STAR team published GGUF files at MERaLiON/MERaLiON-2-3B-GGUF on 18 April 2026, including decoder files and an fp16 mmproj. This repository extends that work with:

  1. A quantized mmproj (Q4_K_M, 533MB) โ€” reducing the audio encoder footprint by 68% vs the official fp16 (1.7GB)
  2. Validated on-device Android deployment โ€” confirmed working on Honor Magic 8 Pro Air via Termux + llama.cpp Vulkan
  3. Detailed Android deployment guide โ€” bridge architecture, Phantom Process Killer workarounds, thermal management

Converted and validated by Mun Yew Loh, 22 April 2026.


What's New vs Official Release

File Official MERaLiON/MERaLiON-2-3B-GGUF This repo
Decoder Q4_K_M โœ… 1.6GB โœ… 1.6GB (identical)
Decoder Q8_0 โœ… 2.6GB โŒ not included
mmproj F16 โœ… 1.7GB โœ… 1.7GB (identical)
mmproj Q4_K_M โŒ not available โœ… 533MB (NEW)
Android deployment guide โŒ not documented โœ… fully documented
Mobile inference validation โŒ not validated โœ… Honor Magic 8 Pro Air

Files

File Size Description
meralion-2-3b-decoder-q4_k_m.gguf 1.6 GB Text decoder (Gemma-2-2B), Q4_K_M โ€” identical to official release
meralion-2-3b-mmproj-q4_k_m.gguf 533 MB Audio encoder + adapter, Q4_K_M โ€” first quantized mmproj
meralion-2-3b-mmproj-f16.gguf 1.7 GB Audio encoder + adapter, original F16 precision โ€” identical to official release

Total footprint (q4 decoder + q4 mmproj): 2.1 GB โ€” vs 3.3 GB with fp16 mmproj.


Model Architecture

MERaLiON-2-3B combines:

  • Speech encoder: Localized Whisper-large-v3 (497 tensors, ~1.28B params)
  • Text decoder: Gemma-2-2B-IT (288 tensors, 2.61B params)
  • Audio adapter: Custom MERaLiON projector (meralion type) bridging encoder โ†’ decoder
  • Projector stack factor: 15 (audio compression ratio)
  • Audio sample rate: 16,000 Hz, mono
  • Mel bins: 128

Supports: ASR, Spoken QA, Paralinguistic Analysis (emotion/tone/stress), Code-switching (Singlish, Mandarin-English, Malay-English), multilingual SEA speech understanding.


mmproj Quantization Notes

The official release provides mmproj in fp16 only (1.7GB). This repository adds a Q4_K_M quantized mmproj (533MB). Key findings from the quantization process:

  • Q8_0 fails โ€” conv1d tensors in the Whisper encoder are not divisible by 32, causing llama_model_quantize: failed to quantize: no tensor type fallback is defined for type q8_0
  • Q4_K_M succeeds with 2 fallback tensors (minor precision impact on those 2 tensors)
  • 68% size reduction: 1638.37 MiB โ†’ 532.90 MiB
  • Inference quality: validated equivalent transcription accuracy on Singlish test phrases

Quantization command:

llama-quantize meralion-2-3b-mmproj-f16.gguf \
               meralion-2-3b-mmproj-q4_k_m.gguf Q4_K_M
# Output: quant size = 532.90 MiB (5.22 BPW), 2 fallback tensors

Android Deployment

MERaLiON-2-3B has been validated running fully on-device on an Honor Magic 8 Pro Air via Termux + llama.cpp with Vulkan GPU acceleration. This is the first publicly documented Android deployment of MERaLiON-2-3B.

Hardware

Component Spec
Device Honor Magic 8 Pro Air (่ฃ่€€Magic8 Pro Air)
SoC Snapdragon 8 Elite Gen 5
RAM 16GB
GPU Mali-G1-Ultra MC12 (Vulkan 1.4.349)
GPU VRAM 15,405 MiB
OS HarmonyOS (Android-compatible)
Storage required ~2.1GB (q4 decoder + q4 mmproj)

Software Stack

  • Runtime: llama.cpp build b8762+ (MERaLiON support merged via PR #21756)
  • GPU backend: Vulkan (-ngl 99, 27/27 layers offloaded to Mali GPU)
  • Inference wrapper: Flask bridge on port 8081 (Python 3, Termux)
  • Frontend: Native Android APK + PWA fallback (Chrome)

Build Instructions (Android/Termux)

# Install dependencies
pkg install cmake clang python git vulkan-tools shaderc spirv-headers

# Clone and build with Vulkan
git clone https://github.com/ggml-org/llama.cpp --depth=1
cd llama.cpp
cmake -B build-vulkan -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
cmake --build build-vulkan --config Release -j4

# Verify GPU is visible
./build-vulkan/bin/llama-mtmd-cli --list-devices
# Expected: Vulkan0: Mali-G1-Ultra MC12 (15405 MiB, 15405 MiB free)

Inference Command

llama-mtmd-cli \
  -m meralion-2-3b-decoder-q4_k_m.gguf \
  --mmproj meralion-2-3b-mmproj-q4_k_m.gguf \
  --audio input.wav \
  -p "Follow the text instruction based on the following audio: <__media__>\nTranscribe exactly what was said." \
  -n 80 -e -ngl 99

Performance (Honor Magic 8 Pro Air, cool device)

Metric Value
GPU layers offloaded 27/27 (full offload)
Audio encoding time ~18s
Prompt eval time ~25s (5.1 tok/s)
Token generation ~1s (20 tok/s)
Total ASR response ~34โ€“36s
Thermal throttle onset ~3 consecutive requests

Known Android Limitations

  1. Thermal throttling โ€” sustained inference causes slowdown after ~2-3 requests. Allow 10-15 min cooldown between heavy sessions.
  2. Phantom Process Killer โ€” Android 12+ kills child processes after ~60s. Fix: run llama-mtmd-cli via nohup shell script with start_new_session=True
   sh_file = tempfile.mktemp(suffix='.sh')
   with open(sh_file, 'w') as f:
       f.write(f'#!/bin/sh\n{cmd} > {out_file} 2>&1\n')
   os.chmod(sh_file, 0o755)
   proc = subprocess.Popen(['nohup', sh_file], start_new_session=True,
                           stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
  1. HarmonyOS adb incompatibility โ€” Honor devices use HDB protocol; standard adb pair wireless pairing fails. Use USB file transfer for APK installation.
  2. Reply generation via silence.wav โ€” second llama-mtmd-cli call for response generation returns empty output in current builds. Use llama-server separately for text generation, or implement keyword-based fallback.

Validated Singlish Transcription Samples

Input phrase Transcript output
"eh MINA good morning lah" "Hey, Mina, good morning, lah."
"walao I very stressed cannot take it anymore" "wow I very stressed. Can't take it anymore. lah."
"I feel very lonely nobody wants to talk to me sia" "I feel very lonely. Nobody wants to talk to me, sia."
"my dog just passed away I very sad" "err, mina, my dog just passed away. I'm very sad."
"good afternoon mina wah today very hot sia" "good afternoon, mina wah today very hot sia."

Paralinguistic Capabilities

MERaLiON-2-3B detects paralinguistic features from voice alongside transcription:

  • Emotion: happy, calm, stressed, anxious, sad, angry, fearful
  • Speech signals: fast speech, hesitations, voice tremors
  • Distress indicators: risk level (none/low/medium/high)

Particularly suitable for:

  • Mental health helpline operator support (MOH-T National Mental Helpline 1771)
  • Scam call detection and voice risk scoring
  • Healthcare consultation transcription with emotional context
  • Elderly companionship and digital inclusion applications

Relationship to Official Release

This repository is an independent community contribution extending the official MERaLiON/MERaLiON-2-3B-GGUF release. All model weights are derived from the original MERaLiON/MERaLiON-2-3B safetensors published by A*STAR IยฒR. The decoder GGUF and fp16 mmproj files in this repository are functionally identical to those in the official release โ€” only the Q4_K_M mmproj is new.

Users should refer to the official repository for the most up-to-date releases and A*STAR-supported files.


Acknowledgements

  • MERaLiON model: A*STAR Institute for Infocomm Research (IยฒR)
  • Official GGUF release: MERaLiON/MERaLiON-2-3B-GGUF
  • llama.cpp MERaLiON support: PR #21756 (merged build b8762)
  • Contributor: Mun Yew Loh, April 2026

This repository is subject to the MERaLiON Public License v3. All derivative works must comply with the original license terms.


Citation

If you use the quantized mmproj or Android deployment guide from this repository:

@misc{loh2026meralion-android,
  title={MERaLiON-2-3B: Quantized mmproj and Android Deployment},
  author={Loh, Mun Yew},
  year={2026},
  url={https://huggingface.co/munyew/MERaLiON-2-3B-GGUF},
  note={Quantized mmproj and Android deployment guide for MERaLiON/MERaLiON-2-3B-GGUF}
}

Please also cite the original MERaLiON papers listed at MERaLiON/MERaLiON-2-3B-GGUF.


Android Deployment with Watchdog

The Phantom Process Killer in Android 12+ kills llama-server and bridge.py aggressively, requiring constant manual restarts.

Use start_mina.sh (available in munyew/roar-companion) for an automatic restart watchdog that keeps both processes alive all day:

chmod +x ~/start_mina.sh
~/start_mina.sh

The watchdog script:

  • Kills any stale llama-server / bridge.py processes on launch
  • Acquires a termux-wake-lock to prevent CPU sleep
  • Starts llama-server with full Vulkan offload (-ngl 99)
  • Waits 20 s for model load, then starts bridge.py via nohup
  • Enters an infinite loop โ€” checks every 10 s and auto-restarts either process if Android kills it
  • Logs every restart with a timestamp

Tested on Honor Magic 8 Pro Air (Snapdragon 8 Elite, 16 GB RAM).
The MINA Android companion app detects bridge loss automatically and shows Reconnecting... โ†’ โ— Ready without any user action.

Downloads last month
414
GGUF
Model size
3B params
Architecture
gemma2
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for munyew/MERaLiON-2-3B-GGUF

Quantized
(2)
this model

Collection including munyew/MERaLiON-2-3B-GGUF