Instructions to use ibm-granite/granite-vision-3.1-2b-preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ibm-granite/granite-vision-3.1-2b-preview with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ibm-granite/granite-vision-3.1-2b-preview") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ibm-granite/granite-vision-3.1-2b-preview") model = AutoModelForImageTextToText.from_pretrained("ibm-granite/granite-vision-3.1-2b-preview") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ibm-granite/granite-vision-3.1-2b-preview with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ibm-granite/granite-vision-3.1-2b-preview" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ibm-granite/granite-vision-3.1-2b-preview", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ibm-granite/granite-vision-3.1-2b-preview
- SGLang
How to use ibm-granite/granite-vision-3.1-2b-preview with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ibm-granite/granite-vision-3.1-2b-preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ibm-granite/granite-vision-3.1-2b-preview", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ibm-granite/granite-vision-3.1-2b-preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ibm-granite/granite-vision-3.1-2b-preview", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ibm-granite/granite-vision-3.1-2b-preview with Docker Model Runner:
docker model run hf.co/ibm-granite/granite-vision-3.1-2b-preview
Use Negative Feature Layer Indices
There is a misalignment in the feature layers that are currently being used between transformers and vLLM (current values are correct for vLLM and off by one for transformers); In transformers, the first entry is the input embedding (here). However, in vLLM, this is not the case for the way the hidden states pool are formed (here).
In other words, the hidden states array in transformers contains 28 entries:[emb, h0, h1, ..., h27]
while the hidden states pool in vLLM contains 27 entries:[h0, h1, ..., h27]
The config reflects the correct values for what is used in vLLM, but is off by one in transformers. Both projects support negative indexing into the hidden states (with offset handling in vLLM, since only the deepest feature layer needed it loaded) - this PR changes the vision feature layers to use negative indices, which will fix the misalignment in transformers without changing the output in vLLM (no code changes needed).
I will also submit a PR to vLLM to add the embeddings to the hidden state pool if all hidden states are requested from the visual encoder.
Thanks @abrooks9944
The model uses negative feature layers now that this PR was merged, but here is the link to the relevant PR for vLLM for positive feature layers - https://github.com/vllm-project/vllm/pull/13514
@abrooks9944
So future models can also use positive features?
Do we need to update anything in this model?
Hey @aarbelle - yes, in future releases of vLLM, it will align with transformers and the correct positive features layers ([4, 8, 16, 27]) can be used to get the same output as the corresponding negative layers ([-24, -20, -12, -1]),
However, I would still suggest using negative feature layers so that the outputs are consistent if the model is used with earlier versions of vLLM; the PR above also fixes a bug in vLLM that causes positive feature layers to load one more layer than needed to get the deepest feature, which will throw if the last layer is requested (e.g., for this model), so keeping it negative will ensure it will work correctly with older vLLM versions!
Also no, nothing needs to be changed in the model :)