GPT-OSS-Swallow-20B-RL-v0.1 — MLX 6-bit

This is a 6-bit quantized MLX version of tokyotech-llm/GPT-OSS-Swallow-20B-RL-v0.1, optimized for Apple Silicon.

GPT-OSS-Swallow is a Japanese-enhanced reasoning LLM built on top of OpenAI's GPT-OSS-20B through continual pre-training, supervised fine-tuning, and reinforcement learning, developed by the Okazaki Laboratory and Yokota Laboratory at Institute of Science Tokyo and AIST.

Key Details

  • Architecture: gpt_oss (Mixture of Experts — 21B total, 3.6B active)
  • Quantization: 6-bit (6.503 bits/weight)
  • Disk size: ~17 GB
  • Peak memory: ~17 GB
  • Converted with: mlx-lm 0.31.0

Note: This conversion required a custom patch to mlx-lm's gpt_oss model definition to handle the bf16 weight format used by the Swallow fine-tuned variant (the original OpenAI model uses MXFP4). The patch adds transpose and interleaved split handling for gate_up_proj / down_proj expert weights. See the Conversion Notes section below.

Why no 4-bit variant?

GPT-OSS uses a Mixture of Experts (MoE) architecture where expert routing is sensitive to quantization. In our testing, 4-bit quantization (both gs64 and gs32) caused the model's analysis channel to loop indefinitely on certain prompts. 6-bit is the lowest quantization that maintains stable reasoning behavior.

Variants

Variant Bits/weight Disk size Repo
6-bit 6.503 ~17 GB this repo
8-bit 8.503 ~22 GB tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-8bit-mlx
fp16 16 ~40 GB tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-fp16-mlx

Usage

CLI

pip install mlx-lm

mlx_lm.generate \
  --model tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx \
  --prompt "日本の首都はどこですか?" \
  --max-tokens 200 \
  --trust-remote-code

mlx_lm.chat \
  --model tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx \
  --trust-remote-code

Python API

from mlx_lm import load, generate

model, tokenizer = load("tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx")

prompt = "Pythonでフィボナッチ数列を出力するコードを書いてください"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=500)

Model Details

  • Base model: openai/gpt-oss-20b
  • Fine-tuned by: tokyotech-llm (Institute of Science Tokyo + AIST)
  • Training: CPT (419B tokens) → SFT (1.1M samples) → RLVR
  • Harmony format: The model uses OpenAI's harmony response format with analysis/final channels
  • Reasoning effort: Configurable via system prompt ("Reasoning: low/medium/high")
  • Recommended generation parameters: Temperature=0.6, TopP=0.95, TopK=20, MinP=0

Conversion Notes

The original GPT-OSS-20B stores MoE expert weights in MXFP4 format (gate_up_proj_blocks / gate_up_proj_scales). The Swallow variant was re-trained in bf16, producing standard gate_up_proj tensors with a different layout:

  • MXFP4 (original): [experts, out_features*2, ...] — split via interleave on second-to-last dim
  • bf16 (Swallow): [experts, in_features, out_features*2] — split via interleave on last dim, then transpose

The mlx-lm gpt_oss sanitize function was patched to detect bf16 weights (absence of _blocks/_scales keys) and apply the correct split + transpose. This patch is required for any GPT-OSS fine-tune that stores weights in bf16 HuggingFace format.

License

Apache 2.0 (inherited from base model)

Citation

@misc{openai2025gptoss,
  title={gpt-oss-120b & gpt-oss-20b Model Card},
  author={OpenAI},
  year={2025},
  eprint={2508.10925},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
Downloads last month
164
Safetensors
Model size
21B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx

Paper for tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx