Mistral-Small-3.1-DRAFT-0.5B-MLX-4bit

A 4-bit quantized MLX conversion of alamios/Mistral-Small-3.1-DRAFT-0.5B, optimized for speculative decoding on Apple Silicon.

Purpose

This is a draft model (0.5B parameters) designed for speculative decoding with larger Mistral-family target models:

  • Mistral Small 3.1 24B
  • Devstral Small 2 24B (mistralai/Devstral-Small-2-24B-Instruct-2512)
  • Any Mistral model sharing the Tekken tokenizer (131K vocab)

Speculative decoding uses this small, fast model to propose candidate tokens, which the larger target model verifies in a single forward pass. The output distribution is mathematically identical to standard decoding — you get the same quality, just faster.

Key Specs

Property Value
Parameters 0.5B
Architecture Qwen2ForCausalLM (backbone)
Tokenizer Mistral Tekken (131,072 tokens)
Quantization 4-bit (affine, group_size=64)
Disk Size ~335 MB
Memory Usage ~300 MB
Context Length 32,768 tokens

Usage with mlx-lm

Speculative Decoding (Python API)

import mlx.core as mx
from mlx_lm import load
from mlx_lm.generate import speculative_generate_step

# Load target model (24B)
model, tokenizer = load("mlx-community/mistralai_Devstral-Small-2-24B-Instruct-2512-MLX-4Bit")

# Load draft model (0.5B)
draft_model, _ = load("badmadrad/Mistral-Small-3.1-DRAFT-0.5B-MLX-4bit")

# Tokenize prompt
prompt = mx.array(tokenizer.encode("Write a Python hello world"))

# Generate with speculative decoding
for token, logprobs, from_draft in speculative_generate_step(
    prompt, model, draft_model,
    num_draft_tokens=3,
    max_tokens=256,
):
    print(tokenizer.decode([token.item()]), end="", flush=True)

With AppleLM

# Set as draft model
alm config set draft_model badmadrad/Mistral-Small-3.1-DRAFT-0.5B-MLX-4bit
alm config set num_draft_tokens 3

# Restart server to load draft model
alm restart

Performance

Expected speedup with speculative decoding depends on acceptance rate (how often the draft model's predictions match the target):

Scenario Expected Speedup
Code generation 1.5-2.5x
Natural language 2-3x
Repetitive/predictable text 2.5-3.5x

Memory overhead is minimal (~300 MB) compared to the 14 GB target model.

Conversion Details

  • Source: alamios/Mistral-Small-3.1-DRAFT-0.5B
  • Converted with: mlx-lm v0.30.6
  • Quantization: 4-bit affine (4.501 bits/weight effective)
  • Command: mlx_lm.convert --hf-path alamios/Mistral-Small-3.1-DRAFT-0.5B --mlx-path . -q --q-bits 4

License

Apache 2.0 — same as the base model.

Downloads last month
143
Safetensors
Model size
92.7M params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for badmadrad/Mistral-Small-3.1-DRAFT-0.5B-MLX-4bit

Quantized
(12)
this model