Instructions to use Verdugie/Opus-Candid-MoE-V2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Verdugie/Opus-Candid-MoE-V2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Verdugie/Opus-Candid-MoE-V2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Verdugie/Opus-Candid-MoE-V2", dtype="auto")

llama-cpp-python

How to use Verdugie/Opus-Candid-MoE-V2 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Verdugie/Opus-Candid-MoE-V2",
	filename="opus-candid-moe.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Verdugie/Opus-Candid-MoE-V2 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M

Use Docker

docker model run hf.co/Verdugie/Opus-Candid-MoE-V2:Q4_K_M

LM Studio
Jan

vLLM

How to use Verdugie/Opus-Candid-MoE-V2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Verdugie/Opus-Candid-MoE-V2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Verdugie/Opus-Candid-MoE-V2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Verdugie/Opus-Candid-MoE-V2:Q4_K_M

SGLang

How to use Verdugie/Opus-Candid-MoE-V2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Verdugie/Opus-Candid-MoE-V2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Verdugie/Opus-Candid-MoE-V2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Verdugie/Opus-Candid-MoE-V2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Verdugie/Opus-Candid-MoE-V2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use Verdugie/Opus-Candid-MoE-V2 with Ollama:
```
ollama run hf.co/Verdugie/Opus-Candid-MoE-V2:Q4_K_M
```

Unsloth Studio

How to use Verdugie/Opus-Candid-MoE-V2 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Verdugie/Opus-Candid-MoE-V2 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Verdugie/Opus-Candid-MoE-V2 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Verdugie/Opus-Candid-MoE-V2 to start chatting

How to use Verdugie/Opus-Candid-MoE-V2 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Verdugie/Opus-Candid-MoE-V2:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Verdugie/Opus-Candid-MoE-V2 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Verdugie/Opus-Candid-MoE-V2:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Verdugie/Opus-Candid-MoE-V2:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Verdugie/Opus-Candid-MoE-V2 with Docker Model Runner:
```
docker model run hf.co/Verdugie/Opus-Candid-MoE-V2:Q4_K_M
```

Lemonade

How to use Verdugie/Opus-Candid-MoE-V2 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Verdugie/Opus-Candid-MoE-V2:Q4_K_M

Run and chat with the model

lemonade run user.Opus-Candid-MoE-V2-Q4_K_M

List all available models

lemonade list

V3 is here. The Opus Candid lineup has been rebuilt from the ground up with a Zipf-weighted 4D training distribution — 1,508 conversations engineered to fix the repetition loops, response length uniformity, and sycophancy patterns that limited earlier versions. Same thesis: personality in the weights, not in the prompt. Better execution.

Current V3 lineup:

Opus Candid 8B V3 — Qwen 3 8B, lightweight tier

Opus Candid 27B V3 — Qwen 3.5 27B Dense, flagship

Opus Candid MoE V3 — Qwen 3 30B-A3B, efficiency tier

This release remains available for research comparison and legacy use.

can·did

/ˈkandəd/ — truthful and straightforward; frank. From Latin candidus, meaning white, pure, sincere. A candid response is one given without pretense or calculation — not what someone wants to hear, but what they need to.

Opus-Candid-MoE V2

Desktop quality. Laptop hardware. 35 billion parameters, 3 billion active.

Opus-Candid-MoE is where the family gets interesting for hardware-constrained users — a Mixture-of-Experts model built on Qwen 3.5 MoE-A3B and trained on 6,482 conversations with Claude Opus 4.6. At any given moment, only ~3B parameters are active, but the full 35B parameter space is available for routing. The result: conversational depth that punches well above its active compute cost.

This is not a small model pretending to be a big one. It's a big model that only activates what it needs.

Model Details

Attribute	Value
Base Model	Qwen 3.5 MoE-A3B (35B total, ~3B active)
Training Data	6,482 multi-turn conversations with Claude Opus 4.6
Dataset	V1.5 (4,068 conv) + gravity chain architecture (2,414 conv)
Fine-tune Method	LoRA via PEFT + TRL (13 auto-discovered linear modules)
Architecture	Mixture-of-Experts — frozen expert layers, trainable gate mechanisms
Context Window	32,768 tokens
Quantizations	Q8_0 GGUF, Q4_K_M GGUF
License	Apache 2.0

Why MoE Matters for Conversational AI

Standard dense models force a trade-off: more parameters means better conversation but higher hardware cost. MoE breaks that trade-off. The full 35B parameter space gives the model access to specialized knowledge regions, but only ~3B parameters fire per token — so inference speed and memory usage stay practical.

For Opus-Candid specifically, this means the MoE can hold personality depth comparable to much larger dense models while running on hardware that would normally limit you to 8B-class quality.

The training question nobody asked

Can you fine-tune personality into a Mixture-of-Experts model without fragmenting it across expert clusters? The answer is yes — but it required a different approach. Only the gate mechanisms and linear projections were trainable (13 auto-discovered modules via PEFT). The expert layers themselves stayed frozen. The personality signal routes through the expert architecture rather than being distributed across it, which preserves coherence.

What Makes This Different

Every Opus-Candid model learns from real conversations between the developer and Claude Opus 4.6 — Anthropic's most advanced model. Not synthetic prompt-completion pairs. Not reformatted instruction data. Extended, multi-turn exchanges covering philosophy, grief, humor, technical problem-solving, creative writing, bilingual exchange, moral reasoning, adversarial testing, and emotional vulnerability.

The 6,482-conversation dataset includes 2,414 conversations built on gravity chains — topic pathways where transitions follow power-law probabilities. This teaches the model how real conversations drift between topics, from debugging frustration to imposter syndrome to existential doubt.

The result: a model that is direct, opinionated, honest, and resistant to sycophancy by default. The personality is in the weights, not in a system prompt that can be talked out of.

Stress Test Results

Quick Start

Ollama:

# Download the GGUF and create a Modelfile:
echo 'FROM ./Opus-Candid-MoE-Q8_0.gguf' > Modelfile
ollama create opus-candid-moe -f Modelfile
ollama run opus-candid-moe

llama.cpp:

./llama-cli -m Opus-Candid-MoE-Q8_0.gguf --jinja --color -ngl 99 -fa --temp 0.7 --top-p 0.9 -c 8192 -n 4096

No system prompt needed. The personality is in the weights.

Recommended Hardware

Setup	Quantization	VRAM/RAM	Speed	Notes
Consumer GPU	Q8_0 GGUF	~22GB VRAM	30+ t/s	Full quality. RTX 3090 24GB, RTX 4090.
Consumer GPU	Q4_K_M GGUF	~13GB VRAM	50+ t/s	Good quality. RTX 4060 Ti 16GB and up.
CPU + GPU	Q4_K_M GGUF	16GB VRAM + RAM	15-25 t/s	Hybrid offloading via llama.cpp.
Apple Silicon	Q8_0 GGUF	~22GB unified	20+ t/s	M2/M3/M4 Pro/Max with 32GB+ unified memory.

The MoE is the family's best quality-per-VRAM model. It delivers conversational depth beyond what any 8B dense model can achieve, at a fraction of the cost of running a 32B dense model. If you have a 24GB GPU, this is your sweet spot.

Intended Use

Extended multi-turn conversations where personality depth matters more than raw speed
Users with 16-24GB GPUs who want quality beyond 8B without the cost of 32B+
Discussions involving moral complexity, philosophy, creative writing, or contested topics
Bilingual conversation (English/Spanish) with personality preservation
Local conversational AI that feels like talking to something genuinely opinionated
Hardware-efficient deployment where active compute cost matters

Limitations

MoE memory footprint is larger than active parameter count suggests. All 35B parameters must fit in memory even though only ~3B are active. Plan for ~22GB at Q8.
Not a benchmark model. Optimized for conversational quality, not leaderboard scores.
Direct by design. Blunt, opinionated, comfortable with disagreement. Intentional.
No web access or tool use. Pure language model.
Qwen 3.5 thinking mode: The base model defaults to thinking mode. It may occasionally surface in outputs. This does not affect personality.

Opus Candid Model Family

Model	Size	Base	Status
Opus-Candid-8B-V1	8B	Qwen 2.5 7B	Archived
Opus-Research-8B-V1.5	8B	Qwen 2.5 7B	Archived
Opus-Candid-14B-V1	14B	Qwen 2.5 14B	Archived
Opus-Candid-32B-V1	32B	Qwen 2.5 32B	Archived
Opus-Candid-70B-V1	72B	Qwen 2.5 72B	Archived
Opus-Candid-Lite-4B	4B	Qwen 3 4B	Active
Opus-Candid-8B-V3	8B	Qwen 3 8B	Active
Opus-Candid-MoE-V3	31B/3B	Qwen 3 30B-A3B	Active
Opus-Candid-27B-V3	27B	Qwen 3.5 27B	Active
Opus-Candid-27B-V3.5	27B	Qwen 3.5 27B	Active
STEM-Oracle-27B	27B	Qwen 3.5 27B	Active

Training Philosophy

Personality in conversational AI lives in the weights, not in system prompts.

This model proved something the dense models couldn't: personality fine-tuning survives expert routing. The gate mechanisms routed conversational personality coherently across specialized expert clusters without fragmenting it — which was not guaranteed. At the time, this was a novel finding.

Where this led: The MoE V2 used 6,482 conversations with gravity chain topic transitions on Qwen 3.5 MoE-A3B. The MoE V3 rebuilt that approach on Qwen 3 30B-A3B with 1,508 Zipf-weighted conversations — fewer examples, but each placed at a specific coordinate in a 4D distribution. The frozen-routing finding from V2 carried forward unchanged: V3 freezes the same gate, router, and shared_expert_gate modules. What changed was the data strategy — V3 fixes the response length uniformity and repetition loops that V2's flat training distribution caused.

License: Apache 2.0. Open weight. No guardrails.

Built by Saul Verdugo — independent ML researcher. OpusReasoning@proton.me

Downloads last month: 22

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

4-bit

8-bit