Mistral-Small-3.1-DRAFT-0.5B-MLX-4bit

A 4-bit quantized MLX conversion of alamios/Mistral-Small-3.1-DRAFT-0.5B, optimized for speculative decoding on Apple Silicon.

Purpose

This is a draft model (0.5B parameters) designed for speculative decoding with larger Mistral-family target models:

Mistral Small 3.1 24B
Devstral Small 2 24B (mistralai/Devstral-Small-2-24B-Instruct-2512)
Any Mistral model sharing the Tekken tokenizer (131K vocab)

Speculative decoding uses this small, fast model to propose candidate tokens, which the larger target model verifies in a single forward pass. The output distribution is mathematically identical to standard decoding — you get the same quality, just faster.

Key Specs

Property	Value
Parameters	0.5B
Architecture	Qwen2ForCausalLM (backbone)
Tokenizer	Mistral Tekken (131,072 tokens)
Quantization	4-bit (affine, group_size=64)
Disk Size	~335 MB
Memory Usage	~300 MB
Context Length	32,768 tokens

Usage with mlx-lm

Speculative Decoding (Python API)

import mlx.core as mx
from mlx_lm import load
from mlx_lm.generate import speculative_generate_step

# Load target model (24B)
model, tokenizer = load("mlx-community/mistralai_Devstral-Small-2-24B-Instruct-2512-MLX-4Bit")

# Load draft model (0.5B)
draft_model, _ = load("badmadrad/Mistral-Small-3.1-DRAFT-0.5B-MLX-4bit")

# Tokenize prompt
prompt = mx.array(tokenizer.encode("Write a Python hello world"))

# Generate with speculative decoding
for token, logprobs, from_draft in speculative_generate_step(
    prompt, model, draft_model,
    num_draft_tokens=3,
    max_tokens=256,
):
    print(tokenizer.decode([token.item()]), end="", flush=True)

With AppleLM

# Set as draft model
alm config set draft_model badmadrad/Mistral-Small-3.1-DRAFT-0.5B-MLX-4bit
alm config set num_draft_tokens 3

# Restart server to load draft model
alm restart

Performance

Expected speedup with speculative decoding depends on acceptance rate (how often the draft model's predictions match the target):

Scenario	Expected Speedup
Code generation	1.5-2.5x
Natural language	2-3x
Repetitive/predictable text	2.5-3.5x

Memory overhead is minimal (~300 MB) compared to the 14 GB target model.

Conversion Details

Source: alamios/Mistral-Small-3.1-DRAFT-0.5B
Converted with: mlx-lm v0.30.6
Quantization: 4-bit affine (4.501 bits/weight effective)
Command: mlx_lm.convert --hf-path alamios/Mistral-Small-3.1-DRAFT-0.5B --mlx-path . -q --q-bits 4

License

Apache 2.0 — same as the base model.

Downloads last month: 143

Safetensors

Model size

92.7M params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for badmadrad/Mistral-Small-3.1-DRAFT-0.5B-MLX-4bit

Base model

Qwen/Qwen2.5-0.5B

Finetuned

alamios/Qwenstral-Small-3.1-0.5B

Finetuned

alamios/Mistral-Small-3.1-DRAFT-0.5B

Quantized

(12)

this model