Instructions to use tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx

Run Hermes

hermes

MLX LM

How to use tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

GPT-OSS-Swallow-20B-RL-v0.1 — MLX 6-bit

This is a 6-bit quantized MLX version of tokyotech-llm/GPT-OSS-Swallow-20B-RL-v0.1, optimized for Apple Silicon.

GPT-OSS-Swallow is a Japanese-enhanced reasoning LLM built on top of OpenAI's GPT-OSS-20B through continual pre-training, supervised fine-tuning, and reinforcement learning, developed by the Okazaki Laboratory and Yokota Laboratory at Institute of Science Tokyo and AIST.

Key Details

Architecture: gpt_oss (Mixture of Experts — 21B total, 3.6B active)
Quantization: 6-bit (6.503 bits/weight)
Disk size: ~17 GB
Peak memory: ~17 GB
Converted with: mlx-lm 0.31.0

Note: This conversion required a custom patch to mlx-lm's gpt_oss model definition to handle the bf16 weight format used by the Swallow fine-tuned variant (the original OpenAI model uses MXFP4). The patch adds transpose and interleaved split handling for gate_up_proj / down_proj expert weights. See the Conversion Notes section below.

Why no 4-bit variant?

GPT-OSS uses a Mixture of Experts (MoE) architecture where expert routing is sensitive to quantization. In our testing, 4-bit quantization (both gs64 and gs32) caused the model's analysis channel to loop indefinitely on certain prompts. 6-bit is the lowest quantization that maintains stable reasoning behavior.

Variants

Variant	Bits/weight	Disk size	Repo
6-bit	6.503	~17 GB	this repo
8-bit	8.503	~22 GB	tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-8bit-mlx
fp16	16	~40 GB	tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-fp16-mlx

Usage

CLI

pip install mlx-lm

mlx_lm.generate \
  --model tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx \
  --prompt "日本の首都はどこですか？" \
  --max-tokens 200 \
  --trust-remote-code

mlx_lm.chat \
  --model tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx \
  --trust-remote-code

Python API

from mlx_lm import load, generate

model, tokenizer = load("tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx")

prompt = "Pythonでフィボナッチ数列を出力するコードを書いてください"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=500)

Model Details

Base model: openai/gpt-oss-20b
Fine-tuned by: tokyotech-llm (Institute of Science Tokyo + AIST)
Training: CPT (419B tokens) → SFT (1.1M samples) → RLVR
Harmony format: The model uses OpenAI's harmony response format with analysis/final channels
Reasoning effort: Configurable via system prompt ("Reasoning: low/medium/high")
Recommended generation parameters: Temperature=0.6, TopP=0.95, TopK=20, MinP=0

Conversion Notes

The original GPT-OSS-20B stores MoE expert weights in MXFP4 format (gate_up_proj_blocks / gate_up_proj_scales). The Swallow variant was re-trained in bf16, producing standard gate_up_proj tensors with a different layout:

MXFP4 (original): [experts, out_features*2, ...] — split via interleave on second-to-last dim
bf16 (Swallow): [experts, in_features, out_features*2] — split via interleave on last dim, then transpose

The mlx-lm gpt_oss sanitize function was patched to detect bf16 weights (absence of _blocks/_scales keys) and apply the correct split + transpose. This patch is required for any GPT-OSS fine-tune that stores weights in bf16 HuggingFace format.

License

Apache 2.0 (inherited from base model)

Citation

@misc{openai2025gptoss,
  title={gpt-oss-120b & gpt-oss-20b Model Card},
  author={OpenAI},
  year={2025},
  eprint={2508.10925},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Downloads last month: 164

Safetensors

Model size

21B params

Tensor type

BF16

U32

MLX

Hardware compatibility

6-bit

Model tree for tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx

Base model

tokyotech-llm/GPT-OSS-Swallow-20B-SFT-v0.1

Finetuned

tokyotech-llm/GPT-OSS-Swallow-20B-RL-v0.1

Quantized

(12)

this model

Paper for tocchitocchi/GPT-OSS-Swallow-20B-RL-v0.1-6bit-mlx

gpt-oss-120b & gpt-oss-20b Model Card

Paper • 2508.10925 • Published Aug 8, 2025 • 21