Instructions to use sokann/GLM-5-GGUF-1.594bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sokann/GLM-5-GGUF-1.594bpw with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="sokann/GLM-5-GGUF-1.594bpw",
	filename="GLM-5-GGUF-1.594bpw.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use sokann/GLM-5-GGUF-1.594bpw with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sokann/GLM-5-GGUF-1.594bpw
# Run inference directly in the terminal:
llama-cli -hf sokann/GLM-5-GGUF-1.594bpw

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sokann/GLM-5-GGUF-1.594bpw
# Run inference directly in the terminal:
llama-cli -hf sokann/GLM-5-GGUF-1.594bpw

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sokann/GLM-5-GGUF-1.594bpw
# Run inference directly in the terminal:
./llama-cli -hf sokann/GLM-5-GGUF-1.594bpw

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sokann/GLM-5-GGUF-1.594bpw
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sokann/GLM-5-GGUF-1.594bpw

Use Docker

docker model run hf.co/sokann/GLM-5-GGUF-1.594bpw

LM Studio
Jan
Ollama
How to use sokann/GLM-5-GGUF-1.594bpw with Ollama:
```
ollama run hf.co/sokann/GLM-5-GGUF-1.594bpw
```

Unsloth Studio

How to use sokann/GLM-5-GGUF-1.594bpw with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sokann/GLM-5-GGUF-1.594bpw to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sokann/GLM-5-GGUF-1.594bpw to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for sokann/GLM-5-GGUF-1.594bpw to start chatting

How to use sokann/GLM-5-GGUF-1.594bpw with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sokann/GLM-5-GGUF-1.594bpw

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "sokann/GLM-5-GGUF-1.594bpw"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use sokann/GLM-5-GGUF-1.594bpw with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sokann/GLM-5-GGUF-1.594bpw

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default sokann/GLM-5-GGUF-1.594bpw

Run Hermes

hermes

Docker Model Runner
How to use sokann/GLM-5-GGUF-1.594bpw with Docker Model Runner:
```
docker model run hf.co/sokann/GLM-5-GGUF-1.594bpw
```

Lemonade

How to use sokann/GLM-5-GGUF-1.594bpw with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull sokann/GLM-5-GGUF-1.594bpw

Run and chat with the model

lemonade run user.GLM-5-GGUF-1.594bpw-{{QUANT_TAG}}

List all available models

lemonade list

GLM-5-GGUF-1.594bpw

This is a 1.6 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM.

The quant aims to achieve best-in-class performance, by relying on SOTA IQK-quants from ik_llama.cpp.

Size

The FFN tensors will take about 127GiB, to be loaded into System RAM and partially into VRAM, leaving absolutely no space for anything else. No GUI, no syslog, no cronie, no chronyd. For the GPU poors, every single bit matters.

The token_embd tensor will take about 510MiB, and that goes into System RAM as well.

The other tensors will take about 10.6GiB, to be loaded into VRAM, leaving some space for context, compute buffer, and the few overflow FFN tensors.

Size from llama-server output:

llm_load_print_meta: model size       = 139.907 GiB (1.594 BPW)
llm_load_print_meta: repeating layers = 138.826 GiB (1.586 BPW, 751.961 B parameters)

Buffer size with -cmoe --no-mmap (need a small swap to load):

llm_load_tensors:        CPU buffer size = 129975.00 MiB
llm_load_tensors:  CUDA_Host buffer size =   510.47 MiB
llm_load_tensors:      CUDA0 buffer size = 10897.35 MiB

Buffer size with ncmoe 74 --no-mmap (doesn't need a swap):

llm_load_tensors:        CPU buffer size = 123043.00 MiB
llm_load_tensors:  CUDA_Host buffer size =   510.47 MiB
llm_load_tensors:      CUDA0 buffer size = 17829.35 MiB

Quality

Recipe

# Attention
blk\..*\.attn_k_b\.weight=q6_0
blk\..*\.attn_v_b\.weight=q6_0

blk\..*\.attn_kv_a_mqa\.weight=iq4_k
blk\..*\.attn_q_a\.weight=iq4_k
blk\..*\.attn_q_b\.weight=iq4_k
blk\..*\.attn_output\.weight=iq5_ks

# First 3 Dense Layers
blk\..*\.ffn_down\.weight=iq4_k
blk\..*\.ffn_(gate|up)\.weight=iq4_k

# Shared Expert Layers
blk\..*\.ffn_down_shexp\.weight=iq4_k
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_k

# Routed Experts Layers
blk\..*\.ffn_(up|gate|down)_exps\.weight=iq1_s_r4

# Indexer
blk\..*\.indexer\.proj\.weight=iq4_k
blk\..*\.indexer\.attn_k\.weight=iq4_k
blk\..*\.indexer\.attn_q_b\.weight=iq4_k

# NextN MTP Layer
blk\..*\.nextn\.embed_tokens\.weight=iq4_k
blk\..*\.nextn\.shared_head_head\.weight=iq4_k
blk\..*\.nextn\.eh_proj\.weight=iq4_k

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq5_ks

PPL result with wiki.test.raw:

Final estimate: PPL over 565 chunks for n_ctx=512 = 6.2248 +/- 0.03964

Can check the graph from https://huggingface.co/ubergarm/GLM-5-GGUF for comparison.

This quant uses the imatrix from unsloth, which seems to allow the model to perform more reliably in actual tasks.

When using the imatrix from ubergarm, PPL is a bit better at 6.1469 +/- 0.03890, but performance is noticeably worse.

Flags

To have usable context size, we have to sacrifice PP by going with the much slower -mla 1, which doesn't use as much VRAM compared to the usual -mla 3.

These flags allow a 75000 context size:

-ot \.(73|74|75|76|77)\.ffn_down_exps=CUDA0 \
-ot \.(75|76|77)\.ffn_(up|gate)_exps=CUDA0 \
-ot exps=CPU \
-mla 1 -c 75000 -ctk q5_0 -khad \
-b 2048 -ub 2048 \
--jinja -cram 0 -mqkv -ger -cuda graphs=1

11 FFN tensors on GPU, the rest on CPU
-mla 1 to squeeze 75000 context in Q5, -khad to reduce quantization error
2048 batch size to allow GPU offload when processing larger prompt

Tested to be working well in both Q&A tasks and agentic tasks, with high difficulty.

Downloads last month: 15

GGUF

Model size

754B params

Architecture

glm-dsa

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sokann/GLM-5-GGUF-1.594bpw

Base model

zai-org/GLM-5

Quantized

(20)

this model