LFM2-2.6B-ttt-rl
LoRA adapter (rank 8) from the first round of CISPO training for Tic Tac Toe, applied on top of anakin87/LFM2-2.6B-ttt-sft.
This adapter must be loaded on top of the SFT base model. The merged version is available as anakin87/LFM2-2.6B-ttt-rl-merged.
This is an intermediate checkpoint from 🎓 LLM RL Environments Lil Course, a hands-on course on building RL environments for Language Models, where models learn from rewards, not examples. It walks through the full process of turning a small open model into a specialist that outperforms a large proprietary one on a specific task (Tic Tac Toe). The final model is anakin87/LFM2-2.6B-mr-tictactoe.
🤗🕹️ Play against the final model
Training
- Algorithm: CISPO via Verifiers RLTrainer
- Environment: anakin87/tictactoe
- Opponents: 20-70% random move probability
- Steps: 600, batch size 256, lr 5e-5, LoRA rank 8
- Hardware: 2x NVIDIA RTX Pro 6000 96GB (~8 hours)
Evaluation (merged)
100 games per setting.
| Model vs random opponent | % Wins | % Draws | % Losses | % Follows format | % Games w invalid moves |
|---|---|---|---|---|---|
| LiquidAI/LFM2-2.6B | 40 | 11 | 49 | 27.8 | 40 |
| anakin87/LFM2-2.6B-ttt-sft | 74 | 13 | 13 | 99.8 | 11 |
| anakin87/LFM2-2.6B-ttt-rl | 86 | 12 | 2 | 100 | 1 |
| Model vs optimal opponent | % Wins | % Draws | % Losses | % Follows format | % Games w invalid moves |
| LiquidAI/LFM2-2.6B | 0 | 11 | 89 | 24.7 | 43 |
| anakin87/LFM2-2.6B-ttt-sft | 0 | 52 | 48 | 99 | 14 |
| anakin87/LFM2-2.6B-ttt-rl | 0 | 85 | 15 | 100 | 1 |
Competent player, but still falls into fork traps against the optimal opponent.
- Downloads last month
- -