Instructions to use CaaLM/CaaLM-v1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use CaaLM/CaaLM-v1-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="CaaLM/CaaLM-v1-GGUF", filename="CaaLM-v1-BF16.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use CaaLM/CaaLM-v1-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf CaaLM/CaaLM-v1-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf CaaLM/CaaLM-v1-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf CaaLM/CaaLM-v1-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf CaaLM/CaaLM-v1-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf CaaLM/CaaLM-v1-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf CaaLM/CaaLM-v1-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf CaaLM/CaaLM-v1-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf CaaLM/CaaLM-v1-GGUF:Q4_K_M
Use Docker
docker model run hf.co/CaaLM/CaaLM-v1-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use CaaLM/CaaLM-v1-GGUF with Ollama:
ollama run hf.co/CaaLM/CaaLM-v1-GGUF:Q4_K_M
- Unsloth Studio new
How to use CaaLM/CaaLM-v1-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CaaLM/CaaLM-v1-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CaaLM/CaaLM-v1-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for CaaLM/CaaLM-v1-GGUF to start chatting
- Docker Model Runner
How to use CaaLM/CaaLM-v1-GGUF with Docker Model Runner:
docker model run hf.co/CaaLM/CaaLM-v1-GGUF:Q4_K_M
- Lemonade
How to use CaaLM/CaaLM-v1-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull CaaLM/CaaLM-v1-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.CaaLM-v1-GGUF-Q4_K_M
List all available models
lemonade list
CaaLM/CaaLM-v1-GGUF
Overview
This repository contains official GGUF quantizations of CaaLM/CaaLM-v1, provided by CaaLM.
CaaLM-v1 is a 1.5B parameter model that predicts the output of code โ without a compiler, runtime, or interpreter. It was trained on real programming languages (Python, JavaScript, Lua, COBOL) alongside 200 synthetically generated fake programming languages, enabling it to predict execution output even for languages it has never seen before.
- Original model: CaaLM/CaaLM-v1
- Base model: Qwen/Qwen2.5-1.5B
- Architecture: Qwen2
- Parameters: 1,543.7M
- License: Apache 2.0
- Task: Code output prediction (text-generation)
Available Quants
All quantizations listed below are official releases from CaaLM.
| Filename | Quantization | Description | Recommended Use |
|---|---|---|---|
CaaLM-v1-F32.gguf |
F32 | Full 32-bit float | Maximum precision, highest VRAM |
CaaLM-v1-F16.gguf |
F16 | 16-bit float | High precision, large memory footprint |
CaaLM-v1-BF16.gguf |
BF16 | Brain float 16 | Good precision, modern hardware |
CaaLM-v1-Q8_0.gguf |
Q8_0 | 8-bit quantization | Near-lossless, recommended if you have the VRAM |
CaaLM-v1-Q6_K.gguf |
Q6_K | 6-bit K-quant | Excellent quality, good balance |
CaaLM-v1-Q5_K_M.gguf |
Q5_K_M | 5-bit K-quant (medium) | Recommended โ great quality/size balance |
CaaLM-v1-Q5_K_S.gguf |
Q5_K_S | 5-bit K-quant (small) | Good quality, smaller than Q5_K_M |
CaaLM-v1-Q5_1.gguf |
Q5_1 | 5-bit legacy | Legacy format |
CaaLM-v1-Q5_0.gguf |
Q5_0 | 5-bit legacy | Legacy format |
CaaLM-v1-Q4_K_M.gguf |
Q4_K_M | 4-bit K-quant (medium) | Recommended โ best 4-bit option |
CaaLM-v1-Q4_K_S.gguf |
Q4_K_S | 4-bit K-quant (small) | Smaller than Q4_K_M, slight quality drop |
CaaLM-v1-Q4_1.gguf |
Q4_1 | 4-bit legacy | Legacy format |
CaaLM-v1-Q4_0.gguf |
Q4_0 | 4-bit legacy | Legacy format, widely compatible |
CaaLM-v1-IQ4_XS.gguf |
IQ4_XS | 4-bit iQuant (extra small) | Smaller than Q4_K_S, competitive quality |
CaaLM-v1-IQ4_NL.gguf |
IQ4_NL | 4-bit iQuant (non-linear) | Good alternative to Q4_0 |
CaaLM-v1-Q3_K_L.gguf |
Q3_K_L | 3-bit K-quant (large) | Low memory, acceptable quality |
CaaLM-v1-Q3_K_M.gguf |
Q3_K_M | 3-bit K-quant (medium) | Low memory use |
CaaLM-v1-Q3_K_S.gguf |
Q3_K_S | 3-bit K-quant (small) | Very low memory use |
CaaLM-v1-IQ3_M.gguf |
IQ3_M | 3-bit iQuant (medium) | Better than Q3_K_M at similar size |
CaaLM-v1-IQ3_S.gguf |
IQ3_S | 3-bit iQuant (small) | Very small footprint |
CaaLM-v1-Q2_K.gguf |
Q2_K | 2-bit K-quant | Minimum quality, maximum compression |
CaaLM-v1-TQ2_0.gguf |
TQ2_0 | 2-bit ternary quant | Experimental ternary quantization |
CaaLM-v1-TQ1_0.gguf |
TQ1_0 | 1-bit ternary quant | Extreme compression, experimental |
Which Quant Should I Use?
By available memory:
| Available VRAM / RAM | Recommended Quant |
|---|---|
| 6 GB+ | Q8_0 |
| 4 GB+ | Q5_K_M or Q6_K |
| 3 GB+ | Q4_K_M |
| 2 GB+ | Q3_K_M or IQ3_M |
| < 2 GB | Q2_K (quality will degrade) |
General guidance: For most users,
Q4_K_MorQ5_K_Moffer the best trade-off between file size and output quality. If you need maximum fidelity, useQ8_0orBF16.
Usage
llama.cpp
./llama-cli \
-m CaaLM-v1-Q4_K_M.gguf \
-p "Code:\na = 6\nb = 7\nprint(a * b)\n\nOutput:\n" \
--temp 0 \
-n 64
Ollama
# Create a Modelfile
cat > Modelfile <<EOF
FROM ./CaaLM-v1-Q4_K_M.gguf
PARAMETER temperature 0
PARAMETER stop "<|im_end|>"
SYSTEM "You predict the output of code snippets."
EOF
ollama create caalm-v1 -f Modelfile
ollama run caalm-v1
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="CaaLM-v1-Q4_K_M.gguf",
n_ctx=512,
)
def predict_output(code: str) -> str:
prompt = f"Code:\n{code}\n\nOutput:\n"
result = llm(
prompt,
max_tokens=128,
temperature=0,
stop=["<|im_end|>", "\n\n\n"],
)
return result["choices"][0]["text"].strip()
# Real language
print(predict_output("a = 6\nb = 7\nprint(a * b)"))
# โ 42
# Novel fake language
print(predict_output("STORE X := 10\nSTORE Y := 5\nSPEAK X + Y"))
# โ 15
Input Format
Always use the following prompt format โ the model completes the Output: section:
Code:
<your code here>
Output:
Example โ Python
Code:
a = 10
b = 20
print(a + b)
Output:
30
Example โ Novel Fake Language (never seen during training)
Code:
SCRIBBLE @x BECOMES 7
SCRIBBLE @y BECOMES 3
YELL @x + @y
Output:
10
Performance
Overall benchmark accuracy: 96.2% (50/52 tests)
| Category | Accuracy | Passed/Total |
|---|---|---|
| Real: Python | 100% | 10/10 |
| Real: JavaScript | 100% | 8/8 |
| Real: Lua | 100% | 6/6 |
| Real: COBOL | 75% | 3/4 |
| Novel Fake: Tier 1 (assign + print) | 100% | 8/8 |
| Novel Fake: Tier 2 (conditionals) | 86% | 6/7 |
| Novel Fake: Tier 3 (loops) | 100% | 4/4 |
| Edge Cases | 100% | 5/5 |
For full benchmark details and known failure cases, see the original model card.
Supported Operations
The model reliably handles:
- Variable assignment and arithmetic
- Print / output statements
- Conditionals (if/else)
- While loops with accumulator patterns
- String output
- Basic error behavior (empty output when conditions not met)
It does not reliably handle: functions, recursion, file I/O, complex data structures, pipes, or multi-line string manipulation.
Limitations
- No actual code execution โ outputs are predictions, not guarantees
- If-without-else edge cases may produce hallucinated else branches
- COBOL numeric padding format is inconsistent
- Long programs may degrade in accuracy as state complexity grows
- Context window is limited to ~512 tokens
- Quantization at Q3 and below may introduce additional errors vs. the original model
Model Lineage
| Model | Base | Description |
|---|---|---|
| LaaLM-v1 | T5-base | Fine-tuned to simulate Linux shell commands |
| LaaLM-exp-v1 | Qwen 3B | Conversational Linux terminal emulation |
| CaaLM-v1 | Qwen 1.5B | Language-agnostic code output prediction (this model) |
License
Apache 2.0 โ inherited from the Qwen 2.5 base model and the original CaaLM-v1.
Links
- Original model: CaaLM/CaaLM-v1
- Demo Space: CaaLM-v1-Demo
- Base model: Qwen/Qwen2.5-1.5B
- Downloads last month
- 1,244
1-bit
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
32-bit
