Instructions to use microsoft/bitnet-b1.58-2B-4T-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/bitnet-b1.58-2B-4T-bf16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/bitnet-b1.58-2B-4T-bf16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/bitnet-b1.58-2B-4T-bf16", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/bitnet-b1.58-2B-4T-bf16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use microsoft/bitnet-b1.58-2B-4T-bf16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/bitnet-b1.58-2B-4T-bf16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/bitnet-b1.58-2B-4T-bf16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/bitnet-b1.58-2B-4T-bf16

SGLang

How to use microsoft/bitnet-b1.58-2B-4T-bf16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/bitnet-b1.58-2B-4T-bf16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/bitnet-b1.58-2B-4T-bf16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/bitnet-b1.58-2B-4T-bf16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/bitnet-b1.58-2B-4T-bf16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/bitnet-b1.58-2B-4T-bf16 with Docker Model Runner:
```
docker model run hf.co/microsoft/bitnet-b1.58-2B-4T-bf16
```

Is the `microsoft/bitnet-b1.58-2B-4T` version missing a custom loader?

by juliaturc - opened May 21, 2025

Discussion

juliaturc

May 21, 2025

I'm trying to load the microsoft/bitnet-b1.58-2B-4T via the Transformers library, as per the documentation:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "microsoft/bitnet-b1.58-2B-4T"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
)

And I'm getting this error, which was reported by others already:

Could not locate the configuration_bitnet.py inside microsoft/bitnet-b1.58-2B-4T

Looking at the repository, there doesn't seem to be a custom loader (there are no python files). Shouldn't there be one?

More generally, could you clarify how exactly the values are packed? Inspecting the model.safetensors file, I'm seeing weights being stored as uint8, which makes me assume that maybe it's using TL2 format (3 weights packed into 5 bits), but there's no way to tell without seeing the custom loader.

BifrostTitan

May 22, 2025

Use this version of the transformers

pip install git+https://github.com/shumingma/transformers.git

These artifacts can be found in the 1bitllm/bitnet-1.58b-3b files but they do not work for this tokenizer since it uses llama3 not bitnet tokenizer.
I believe we are waiting for an update on the conversion code to convert it properly with llama3.

attempt 1 - successfully converted but does not respond correctly
I have used convert-ms-to-gguf-bitnet.py to generate the gguf but the tokenizer is broken based on the inference response. "the... the... ,, the.. etc".
attempt 2 - successfully converted but does not respond correctly
I modified convert-hf-to-gguf-bitnet.py to use BPE/gpt2 which causes the inference response to be broken words with no token comprehension like repeating its system prompt then responding with 0 afterward.
attempt 3 - successfully converted but responds with empty spaces repeating infinitely
Using the .model file from the artifacts in 1bitllm which has 32k vocab does not work with the spm settings for either script even when padding is enabled.

I have been able to successfully fine tune this model with a sft trainer but getting it to gguf has been the challenge without a proper tokenizer.model or BPEvocab conversion for llama3.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment