Instructions to use microsoft/bitnet-b1.58-2B-4T-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/bitnet-b1.58-2B-4T-bf16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/bitnet-b1.58-2B-4T-bf16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/bitnet-b1.58-2B-4T-bf16", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("microsoft/bitnet-b1.58-2B-4T-bf16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/bitnet-b1.58-2B-4T-bf16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/bitnet-b1.58-2B-4T-bf16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/bitnet-b1.58-2B-4T-bf16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/microsoft/bitnet-b1.58-2B-4T-bf16
- SGLang
How to use microsoft/bitnet-b1.58-2B-4T-bf16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/bitnet-b1.58-2B-4T-bf16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/bitnet-b1.58-2B-4T-bf16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/bitnet-b1.58-2B-4T-bf16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/bitnet-b1.58-2B-4T-bf16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use microsoft/bitnet-b1.58-2B-4T-bf16 with Docker Model Runner:
docker model run hf.co/microsoft/bitnet-b1.58-2B-4T-bf16
Is the `microsoft/bitnet-b1.58-2B-4T` version missing a custom loader?
I'm trying to load the microsoft/bitnet-b1.58-2B-4T via the Transformers library, as per the documentation:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "microsoft/bitnet-b1.58-2B-4T"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
)
And I'm getting this error, which was reported by others already:
Could not locate the configuration_bitnet.py inside microsoft/bitnet-b1.58-2B-4T
Looking at the repository, there doesn't seem to be a custom loader (there are no python files). Shouldn't there be one?
More generally, could you clarify how exactly the values are packed? Inspecting the model.safetensors file, I'm seeing weights being stored as uint8, which makes me assume that maybe it's using TL2 format (3 weights packed into 5 bits), but there's no way to tell without seeing the custom loader.
Use this version of the transformers
pip install git+https://github.com/shumingma/transformers.git
These artifacts can be found in the 1bitllm/bitnet-1.58b-3b files but they do not work for this tokenizer since it uses llama3 not bitnet tokenizer.
I believe we are waiting for an update on the conversion code to convert it properly with llama3.
- attempt 1 - successfully converted but does not respond correctly
I have used convert-ms-to-gguf-bitnet.py to generate the gguf but the tokenizer is broken based on the inference response. "the... the... ,, the.. etc". - attempt 2 - successfully converted but does not respond correctly
I modified convert-hf-to-gguf-bitnet.py to use BPE/gpt2 which causes the inference response to be broken words with no token comprehension like repeating its system prompt then responding with 0 afterward. - attempt 3 - successfully converted but responds with empty spaces repeating infinitely
Using the .model file from the artifacts in 1bitllm which has 32k vocab does not work with the spm settings for either script even when padding is enabled.
I have been able to successfully fine tune this model with a sft trainer but getting it to gguf has been the challenge without a proper tokenizer.model or BPEvocab conversion for llama3.