Instructions to use ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled", filename="qwen2.5-coder-3b-instruct.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M
Use Docker
docker model run hf.co/ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M
- Ollama
How to use ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled with Ollama:
ollama run hf.co/ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M
- Unsloth Studio
How to use ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled to start chatting
- Pi
How to use ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled with Docker Model Runner:
docker model run hf.co/ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M
- Lemonade
How to use ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled:Q4_K_M
Run and chat with the model
lemonade run user.qwen2.5-coder-3b-claude_opus_4.6-distilled-Q4_K_M
List all available models
lemonade list
🌟 Qwen2.5-Coder-3B — Claude Opus 4.6 Reasoning Distilled
A compact, fast, locally-runnable coding model fine-tuned on top of Qwen2.5-Coder-3B-Instruct using high-quality reasoning trajectories distilled from Claude 4.6 Opus. Designed to run efficiently on consumer hardware with as little as 4GB VRAM at ~88 tokens/sec.
💡 Model Introduction
Qwen2.5-Coder-3B-Claude-Opus-4.6-Distilled combines the strong code generation foundation of Qwen2.5-Coder with the structured, step-by-step reasoning style of Claude 4.6 Opus. Through Supervised Fine-Tuning (SFT) with LoRA, the model learns to think through problems carefully inside <think> tags before delivering precise, well-structured answers.
Unlike larger distilled models, this 3B model is built for real local inference — fast, private, and fits comfortably in 4GB VRAM.
🧠 Reasoning Style
The model adopts Claude Opus's structured reasoning pattern:
<think>
Let me analyze this carefully.
1. Identify the core objective.
2. Break down into subcomponents.
3. Consider edge cases and constraints.
4. Formulate and verify the solution.
</think>
[Final clean answer here]
🗺️ Training Pipeline
Base Model (Qwen/Qwen2.5-Coder-3B-Instruct)
│
▼
Supervised Fine-Tuning (SFT) + LoRA (r=16)
│ • 3,209 high-quality Claude reasoning samples
│ • Unsloth 2x faster training
│ • 1 epoch on T4 GPU (~46 mins)
│ • Final loss: 0.88
▼
Qwen2.5-Coder-3B-Claude-Opus-4.6-Distilled
📋 Training Details
| Parameter | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-Coder-3B-Instruct |
| Framework | Unsloth 2026.3 |
| LoRA rank | 16 |
| LoRA alpha | 16 |
| Trainable params | 29,933,568 (0.96%) |
| Batch size | 16 (4 × 4 grad accum) |
| Learning rate | 2e-4 |
| Epochs | 1 |
| Max seq length | 4096 |
| Final train loss | 0.88 |
| GPU | Tesla T4 (16GB) |
| Training time | ~46 mins |
📚 Datasets Used
| Dataset | Samples | Purpose |
|---|---|---|
| nohurry/Opus-4.6-Reasoning-3000x-filtered | 2,326 | Claude 4.6 Opus reasoning trajectories |
| TeichAI/claude-4.5-opus-high-reasoning-250x | 250 | High-intensity structured reasoning |
| Jackrong/Qwen3.5-reasoning-700x | 633 | Step-by-step reasoning diversity |
| Total | 3,209 |
🚀 Running Locally
Via Ollama (easiest)
ollama run hf.co/ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled
Via llama.cpp (for GPU acceleration)
./llama-cli.exe \
-m qwen2.5-coder-3b-claude_opus_4.6-distilled.Q4_K_M.gguf \
-ngl 99 \
--flash-attn on \
--jinja \
-cnv \
--repeat-penalty 1.1 \
-p "You are a helpful assistant that thinks step by step."
🌟 Core Capabilities
- Structured Reasoning — thinks through problems step by step in
<think>blocks before answering - Code Generation — built on Qwen2.5-Coder, strong at Python, JavaScript, algorithms
- Math & Logic — correctly solves multi-step problems with verification
- Fast Local Inference — 88 t/s on RTX 3050 4GB, fully GPU-accelerated
⚡ Hardware Requirements
| Quantization | VRAM | Speed (RTX 3050) |
|---|---|---|
| Q4_K_M (this file) | ~2.1 GB | ~88 t/s |
| Q3_K_M | ~1.7 GB | ~95 t/s |
| Q8_0 | ~3.3 GB | ~70 t/s |
Runs comfortably on 4GB VRAM laptops and desktops.
⚠️ Limitations
- 3B scale — will struggle with very long multi-file code generation or complex system design
- 1 epoch training — reasoning style is distilled but not as deep as larger models
- Hallucination risk — like all LLMs, may produce incorrect facts; always verify outputs
🙏 Acknowledgements
- Unsloth AI for making fine-tuning accessible on consumer hardware
- nohurry, TeichAI, and Jackrong for the high-quality distillation datasets
- Qwen team for the excellent Qwen2.5-Coder base model
📖 Citation
@misc{ryzdfm_qwen25coder_claude_distilled,
title = {Qwen2.5-Coder-3B Claude Opus 4.6 Reasoning Distilled},
author = {ryzdfm},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ryzdfm/qwen2.5-coder-3b-claude_opus_4.6-distilled}}
}
- Downloads last month
- 3,274