LFM2-2.6B-ttt-rl

LoRA adapter (rank 8) from the first round of CISPO training for Tic Tac Toe, applied on top of anakin87/LFM2-2.6B-ttt-sft.

This adapter must be loaded on top of the SFT base model. The merged version is available as anakin87/LFM2-2.6B-ttt-rl-merged.

This is an intermediate checkpoint from 🎓 LLM RL Environments Lil Course, a hands-on course on building RL environments for Language Models, where models learn from rewards, not examples. It walks through the full process of turning a small open model into a specialist that outperforms a large proprietary one on a specific task (Tic Tac Toe). The final model is anakin87/LFM2-2.6B-mr-tictactoe.

🤗🕹️ Play against the final model

Training

Algorithm: CISPO via Verifiers RLTrainer
Environment: anakin87/tictactoe
Opponents: 20-70% random move probability
Steps: 600, batch size 256, lr 5e-5, LoRA rank 8
Hardware: 2x NVIDIA RTX Pro 6000 96GB (~8 hours)

Evaluation (merged)

100 games per setting.

Model vs random opponent	% Wins	% Draws	% Losses	% Follows format	% Games w invalid moves
LiquidAI/LFM2-2.6B	40	11	49	27.8	40
anakin87/LFM2-2.6B-ttt-sft	74	13	13	99.8	11
anakin87/LFM2-2.6B-ttt-rl	86	12	2	100	1

Model vs optimal opponent	% Wins	% Draws	% Losses	% Follows format	% Games w invalid moves
LiquidAI/LFM2-2.6B	0	11	89	24.7	43
anakin87/LFM2-2.6B-ttt-sft	0	52	48	99	14
anakin87/LFM2-2.6B-ttt-rl	0	85	15	100	1

Competent player, but still falls into fork traps against the optimal opponent.

Downloads last month: -

Model tree for anakin87/LFM2-2.6B-ttt-rl

Base model

LiquidAI/LFM2-2.6B

Finetuned

anakin87/LFM2-2.6B-ttt-sft

Adapter

(1)

this model

Collection including anakin87/LFM2-2.6B-ttt-rl

LFM2 2.6B Mr. Tic Tac Toe ❌ ⭕

Collection

Dataset and models for transforming LFM2 2.6B into a Tic Tac Toe master using RL Environments. Free course: https://t.ly/4jIFq • 8 items • Updated 4 days ago • 2