Instructions to use allenai/OLMo-7B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use allenai/OLMo-7B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="allenai/OLMo-7B-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use allenai/OLMo-7B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "allenai/OLMo-7B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "allenai/OLMo-7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/allenai/OLMo-7B-Instruct

SGLang

How to use allenai/OLMo-7B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "allenai/OLMo-7B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "allenai/OLMo-7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "allenai/OLMo-7B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "allenai/OLMo-7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use allenai/OLMo-7B-Instruct with Docker Model Runner:
```
docker model run hf.co/allenai/OLMo-7B-Instruct
```

Update README.md

by shanearora - opened Apr 25, 2024

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

-16

Files changed (1) hide show

README.md +5 -16

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ We release all code, checkpoints, logs (coming soon), and details involved in tr
 OLMo 7B Instruct and OLMo SFT are two adapted versions of these models trained for better question answering.
 They show the performance gain that OLMo base models can achieve with existing fine-tuning techniques.
-*Note:* This model requires installing `ai2-olmo` with pip and using HuggingFace Transformers<=4.39. New versions of the model will be released soon with compatibility improvements.
 ## Model Details
@@ -82,11 +82,9 @@ pip install ai2-olmo
 ```
 Now, proceed as usual with HuggingFace:
 ```python
-import hf_olmo
-from transformers import AutoModelForCausalLM, AutoTokenizer
-olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B-Instruct")
-tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B-Instruct")
 chat = [
     { "role": "user", "content": "What is language modeling?" },
 ]
@@ -99,17 +97,8 @@ response = olmo.generate(input_ids=inputs.to(olmo.device), max_new_tokens=100, d
 print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
 >> '<|user|>\nWhat is language modeling?\n<|assistant|>\nLanguage modeling is a type of natural language processing (NLP) task or machine learning task that...'
 ```
-Alternatively, with the pipeline abstraction:
-```python
-import hf_olmo
-from transformers import pipeline
-olmo_pipe = pipeline("text-generation", model="allenai/OLMo-7B-Instruct")
-print(olmo_pipe("What is language modeling?"))
->> '[{'generated_text': 'What is language modeling?\nLanguage modeling is a type of natural language processing (NLP) task...'}]'
-```
-Or, you can make this slightly faster by quantizing the model, e.g. `AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B-Instruct", torch_dtype=torch.float16, load_in_8bit=True)` (requires `bitsandbytes`).
 The quantized model is more sensitive to typing / cuda, so it is recommended to pass the inputs as `inputs.input_ids.to('cuda')` to avoid potential issues.
 Note, you may see the following error if `ai2-olmo` is not installed correctly, which is caused by internal Python check naming. We'll update the code soon to make this error clearer.

 OLMo 7B Instruct and OLMo SFT are two adapted versions of these models trained for better question answering.
 They show the performance gain that OLMo base models can achieve with existing fine-tuning techniques.
+*Note:* This model requires installing `ai2-olmo` with pip and using `ai2-olmo`>=0.3.0 or HuggingFace Transformers<=4.39. New versions of the model will be released soon with compatibility improvements.
 ## Model Details
 ```
 Now, proceed as usual with HuggingFace:
 ```python
+from hf_olmo import OLMoForCausalLM, OLMoTokenizerFast
+olmo = OLMoForCausalLM.from_pretrained("allenai/OLMo-7B-Instruct")
+tokenizer = OLMoTokenizerFast.from_pretrained("allenai/OLMo-7B-Instruct")
 chat = [
     { "role": "user", "content": "What is language modeling?" },
 ]
 print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
 >> '<|user|>\nWhat is language modeling?\n<|assistant|>\nLanguage modeling is a type of natural language processing (NLP) task or machine learning task that...'
 ```
+You can make this slightly faster by quantizing the model, e.g. `OLMoForCausalLM.from_pretrained("allenai/OLMo-7B-Instruct", torch_dtype=torch.float16, load_in_8bit=True)` (requires `bitsandbytes`).
 The quantized model is more sensitive to typing / cuda, so it is recommended to pass the inputs as `inputs.input_ids.to('cuda')` to avoid potential issues.
 Note, you may see the following error if `ai2-olmo` is not installed correctly, which is caused by internal Python check naming. We'll update the code soon to make this error clearer.