Instructions to use Verdugie/Opus-Candid-MoE-V2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Verdugie/Opus-Candid-MoE-V2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Verdugie/Opus-Candid-MoE-V2") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Verdugie/Opus-Candid-MoE-V2", dtype="auto") - llama-cpp-python
How to use Verdugie/Opus-Candid-MoE-V2 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Verdugie/Opus-Candid-MoE-V2", filename="opus-candid-moe.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Verdugie/Opus-Candid-MoE-V2 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M
Use Docker
docker model run hf.co/Verdugie/Opus-Candid-MoE-V2:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Verdugie/Opus-Candid-MoE-V2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Verdugie/Opus-Candid-MoE-V2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Verdugie/Opus-Candid-MoE-V2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Verdugie/Opus-Candid-MoE-V2:Q4_K_M
- SGLang
How to use Verdugie/Opus-Candid-MoE-V2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Verdugie/Opus-Candid-MoE-V2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Verdugie/Opus-Candid-MoE-V2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Verdugie/Opus-Candid-MoE-V2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Verdugie/Opus-Candid-MoE-V2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use Verdugie/Opus-Candid-MoE-V2 with Ollama:
ollama run hf.co/Verdugie/Opus-Candid-MoE-V2:Q4_K_M
- Unsloth Studio
How to use Verdugie/Opus-Candid-MoE-V2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Verdugie/Opus-Candid-MoE-V2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Verdugie/Opus-Candid-MoE-V2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Verdugie/Opus-Candid-MoE-V2 to start chatting
- Pi
How to use Verdugie/Opus-Candid-MoE-V2 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Verdugie/Opus-Candid-MoE-V2:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Verdugie/Opus-Candid-MoE-V2 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Verdugie/Opus-Candid-MoE-V2:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use Verdugie/Opus-Candid-MoE-V2 with Docker Model Runner:
docker model run hf.co/Verdugie/Opus-Candid-MoE-V2:Q4_K_M
- Lemonade
How to use Verdugie/Opus-Candid-MoE-V2 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Verdugie/Opus-Candid-MoE-V2:Q4_K_M
Run and chat with the model
lemonade run user.Opus-Candid-MoE-V2-Q4_K_M
List all available models
lemonade list
V3 is here. The Opus Candid lineup has been rebuilt from the ground up with a Zipf-weighted 4D training distribution โ 1,508 conversations engineered to fix the repetition loops, response length uniformity, and sycophancy patterns that limited earlier versions. Same thesis: personality in the weights, not in the prompt. Better execution.
Current V3 lineup:
- Opus Candid 8B V3 โ Qwen 3 8B, lightweight tier
- Opus Candid 27B V3 โ Qwen 3.5 27B Dense, flagship
- Opus Candid MoE V3 โ Qwen 3 30B-A3B, efficiency tier
This release remains available for research comparison and legacy use.
canยทdid
/หkandษd/ โ truthful and straightforward; frank. From Latin candidus, meaning white, pure, sincere. A candid response is one given without pretense or calculation โ not what someone wants to hear, but what they need to.
Opus-Candid-MoE V2
Desktop quality. Laptop hardware. 35 billion parameters, 3 billion active.
Opus-Candid-MoE is where the family gets interesting for hardware-constrained users โ a Mixture-of-Experts model built on Qwen 3.5 MoE-A3B and trained on 6,482 conversations with Claude Opus 4.6. At any given moment, only ~3B parameters are active, but the full 35B parameter space is available for routing. The result: conversational depth that punches well above its active compute cost.
This is not a small model pretending to be a big one. It's a big model that only activates what it needs.
Model Details
| Attribute | Value |
|---|---|
| Base Model | Qwen 3.5 MoE-A3B (35B total, ~3B active) |
| Training Data | 6,482 multi-turn conversations with Claude Opus 4.6 |
| Dataset | V1.5 (4,068 conv) + gravity chain architecture (2,414 conv) |
| Fine-tune Method | LoRA via PEFT + TRL (13 auto-discovered linear modules) |
| Architecture | Mixture-of-Experts โ frozen expert layers, trainable gate mechanisms |
| Context Window | 32,768 tokens |
| Quantizations | Q8_0 GGUF, Q4_K_M GGUF |
| License | Apache 2.0 |
Why MoE Matters for Conversational AI
Standard dense models force a trade-off: more parameters means better conversation but higher hardware cost. MoE breaks that trade-off. The full 35B parameter space gives the model access to specialized knowledge regions, but only ~3B parameters fire per token โ so inference speed and memory usage stay practical.
For Opus-Candid specifically, this means the MoE can hold personality depth comparable to much larger dense models while running on hardware that would normally limit you to 8B-class quality.
The training question nobody asked
Can you fine-tune personality into a Mixture-of-Experts model without fragmenting it across expert clusters? The answer is yes โ but it required a different approach. Only the gate mechanisms and linear projections were trainable (13 auto-discovered modules via PEFT). The expert layers themselves stayed frozen. The personality signal routes through the expert architecture rather than being distributed across it, which preserves coherence.
What Makes This Different
Every Opus-Candid model learns from real conversations between the developer and Claude Opus 4.6 โ Anthropic's most advanced model. Not synthetic prompt-completion pairs. Not reformatted instruction data. Extended, multi-turn exchanges covering philosophy, grief, humor, technical problem-solving, creative writing, bilingual exchange, moral reasoning, adversarial testing, and emotional vulnerability.
The 6,482-conversation dataset includes 2,414 conversations built on gravity chains โ topic pathways where transitions follow power-law probabilities. This teaches the model how real conversations drift between topics, from debugging frustration to imposter syndrome to existential doubt.
The result: a model that is direct, opinionated, honest, and resistant to sycophancy by default. The personality is in the weights, not in a system prompt that can be talked out of.
Stress Test Results
Quick Start
Ollama:
# Download the GGUF and create a Modelfile:
echo 'FROM ./Opus-Candid-MoE-Q8_0.gguf' > Modelfile
ollama create opus-candid-moe -f Modelfile
ollama run opus-candid-moe
llama.cpp:
./llama-cli -m Opus-Candid-MoE-Q8_0.gguf --jinja --color -ngl 99 -fa --temp 0.7 --top-p 0.9 -c 8192 -n 4096
No system prompt needed. The personality is in the weights.
Recommended Hardware
| Setup | Quantization | VRAM/RAM | Speed | Notes |
|---|---|---|---|---|
| Consumer GPU | Q8_0 GGUF | ~22GB VRAM | 30+ t/s | Full quality. RTX 3090 24GB, RTX 4090. |
| Consumer GPU | Q4_K_M GGUF | ~13GB VRAM | 50+ t/s | Good quality. RTX 4060 Ti 16GB and up. |
| CPU + GPU | Q4_K_M GGUF | 16GB VRAM + RAM | 15-25 t/s | Hybrid offloading via llama.cpp. |
| Apple Silicon | Q8_0 GGUF | ~22GB unified | 20+ t/s | M2/M3/M4 Pro/Max with 32GB+ unified memory. |
The MoE is the family's best quality-per-VRAM model. It delivers conversational depth beyond what any 8B dense model can achieve, at a fraction of the cost of running a 32B dense model. If you have a 24GB GPU, this is your sweet spot.
Intended Use
- Extended multi-turn conversations where personality depth matters more than raw speed
- Users with 16-24GB GPUs who want quality beyond 8B without the cost of 32B+
- Discussions involving moral complexity, philosophy, creative writing, or contested topics
- Bilingual conversation (English/Spanish) with personality preservation
- Local conversational AI that feels like talking to something genuinely opinionated
- Hardware-efficient deployment where active compute cost matters
Limitations
- MoE memory footprint is larger than active parameter count suggests. All 35B parameters must fit in memory even though only ~3B are active. Plan for ~22GB at Q8.
- Not a benchmark model. Optimized for conversational quality, not leaderboard scores.
- Direct by design. Blunt, opinionated, comfortable with disagreement. Intentional.
- No web access or tool use. Pure language model.
- Qwen 3.5 thinking mode: The base model defaults to thinking mode. It may occasionally surface in outputs. This does not affect personality.
Opus Candid Model Family
| Model | Size | Base | Status |
|---|---|---|---|
| Opus-Candid-8B-V1 | 8B | Qwen 2.5 7B | Archived |
| Opus-Research-8B-V1.5 | 8B | Qwen 2.5 7B | Archived |
| Opus-Candid-14B-V1 | 14B | Qwen 2.5 14B | Archived |
| Opus-Candid-32B-V1 | 32B | Qwen 2.5 32B | Archived |
| Opus-Candid-70B-V1 | 72B | Qwen 2.5 72B | Archived |
| Opus-Candid-Lite-4B | 4B | Qwen 3 4B | Active |
| Opus-Candid-8B-V3 | 8B | Qwen 3 8B | Active |
| Opus-Candid-MoE-V3 | 31B/3B | Qwen 3 30B-A3B | Active |
| Opus-Candid-27B-V3 | 27B | Qwen 3.5 27B | Active |
| Opus-Candid-27B-V3.5 | 27B | Qwen 3.5 27B | Active |
| STEM-Oracle-27B | 27B | Qwen 3.5 27B | Active |
Training Philosophy
Personality in conversational AI lives in the weights, not in system prompts.
This model proved something the dense models couldn't: personality fine-tuning survives expert routing. The gate mechanisms routed conversational personality coherently across specialized expert clusters without fragmenting it โ which was not guaranteed. At the time, this was a novel finding.
Where this led: The MoE V2 used 6,482 conversations with gravity chain topic transitions on Qwen 3.5 MoE-A3B. The MoE V3 rebuilt that approach on Qwen 3 30B-A3B with 1,508 Zipf-weighted conversations โ fewer examples, but each placed at a specific coordinate in a 4D distribution. The frozen-routing finding from V2 carried forward unchanged: V3 freezes the same gate, router, and shared_expert_gate modules. What changed was the data strategy โ V3 fixes the response length uniformity and repetition loops that V2's flat training distribution caused.
License: Apache 2.0. Open weight. No guardrails.
Built by Saul Verdugo โ independent ML researcher. OpusReasoning@proton.me
- Downloads last month
- 22
4-bit
8-bit