Instructions to use llava-hf/llava-1.5-7b-hf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use llava-hf/llava-1.5-7b-hf with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="llava-hf/llava-1.5-7b-hf") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf") model = AutoModelForImageTextToText.from_pretrained("llava-hf/llava-1.5-7b-hf") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use llava-hf/llava-1.5-7b-hf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "llava-hf/llava-1.5-7b-hf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "llava-hf/llava-1.5-7b-hf", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/llava-hf/llava-1.5-7b-hf
- SGLang
How to use llava-hf/llava-1.5-7b-hf with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "llava-hf/llava-1.5-7b-hf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "llava-hf/llava-1.5-7b-hf", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "llava-hf/llava-1.5-7b-hf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "llava-hf/llava-1.5-7b-hf", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use llava-hf/llava-1.5-7b-hf with Docker Model Runner:
docker model run hf.co/llava-hf/llava-1.5-7b-hf
[Question] Why does LLaVA evaluation assert batch_size == 1 for benchmark inference?
Hi Hugging Face team,
We are evaluating llava-hf/llava-1.5-7b-hf on MME and found a small but reproducible batch-size-dependent variance in greedy generation. We are also using benchmarks MMBench, POPE, TextVQA, ai2d, MMStar, GQA, ChartQA, RefCoCo, DocVQA, but the MME Cognition score gets affected the most, so we specifically did experiments on that.
We are opening this here because our current implementation uses the Hugging Face LlavaForConditionalGeneration / processor path, not the original Liuhaotian LLaVA runtime. We would really appreciate your help investigating whether this is expected numerical behavior, a batching/padding edge case in the HF LLaVA implementation, or something we should configure differently.
Setup
- Model: llava-hf/llava-1.5-7b-hf
- Benchmark: MME
- Decoding: greedy
- do_sample=False
- num_beams=1
- temperature=0
- Precision: fp16
- Padding: left padding
- Comparison: batch size 1 vs batch size 16
- Task format: yes/no MME questions
Observed Difference
Most predictions match exactly between batch size 1 and batch size 16. However, two MME rows consistently flip:
1. commonsense_reasoning/0042/0
Question:
Are there usually cars in the area shown in the picture? Please answer yes or no.
Gold: yes
- Batch size 1: Yes
- Batch size 16: No
2. numerical_calculation/0014/0
Question:
Is the answer to the arithmetic question in the image 200? Please answer yes or no.
Gold: yes
- Batch size 1: Yes
- Batch size 16: No
This changes the MME score slightly:
batch size 1
MME-C: ~350.36; total MME: ~1859.87
batch size 16
MME-C: ~340.71; total MME: ~1850.23
Diagnostics We Ran
We built a focused diagnostic harness for the two flipped rows instead of rerunning the full benchmark each time.
We tested:
- single-row input
- duplicate batch
- the two flipped rows together
- the exact real batch-16 context from the benchmark run
- padding stress batches with heterogeneous sequence/image contexts
We also tested several generation/runtime settings:
- default fp16
- fp16 + explicit position_ids
- fp16 + eager attention
- fp16 + eager attention + explicit position_ids
- fp16 + use_cache=False
- fp32 diagnostic run
For explicit position_ids, we computed them from the attention mask:
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids = position_ids.masked_fill(attention_mask == 0, 0)
Our findings:
- Explicit
position_idsdid not fix the real batch-16 flip. - Eager attention did not fix the real batch-16 flip.
- Eager attention + explicit
position_idsdid not fix it. use_cache=Falsedid not fix it.- fp32 was stable across the tested batch contexts.
- The fp16 first-token Yes/No logits are extremely close, so small batch-shape/kernel differences appear sufficient to change the argmax.
We also checked the inputs after removing left padding. For the target row, the effective input_ids, position_ids, and pixel_values are identical between the single-row input and the real batch-16 input. That makes us suspect this is not a prompt construction or dataset ordering bug, but likely numerical drift from fp16 batched computation or a model/generation implementation detail.
Questions
Is batch-size-dependent output variance expected for
LlavaForConditionalGenerationunder fp16 greedy generation?Did the Hugging Face team observe similar behavior when validating
llava-hf/llava-1.5-7b-hfagainst the original LLaVA results?For reproducing reported LLaVA benchmark numbers, do you recommend evaluating with batch size 1, or should batched evaluation be considered valid?
Are there recommended settings for stable batched evaluation with HF LLaVA models?
- bf16 instead of fp16?
- fp32 for benchmark reporting?
- eager attention?
- explicit
position_ids? - setting explicit tokenizer path?
Would the HF team be able to allocate some time to investigate this? We can provide examples, exact prompts, diagnostic logs, and a minimal reproduction harness if usefuf for all the benchmarks we are running if that's helpful.
This matters for us because we are comparing full-token and compressed-token variants, and we need benchmark scores to be stable across batch sizes. Running all evaluations at batch size 1 is possible but very expensive, and it is not ideal if the expected HF behavior is that batched greedy generation should be equivalent up to tiny numerical differences.
Tagging teammates for context: @ATATC @charlesgchen
Thanks.
Hey @austinkomachi , thanks for the detailed report!
This is expected behavior rather than a bug in the HF implementation. The numerical variance might come from two sources:
Batched matmul numerical differences β PyTorch uses different CUDA kernels for a single-row input vs a batched one, which leads to tiny floating point differences in the output logits. This is not specific to LLaVA and we can't "fix" it to be deterministic
KV cache β we use key-value caching by default, which can also introduce small numerical drift. You can try disabling it (
use_cache=False) to see if it shifts things, though it won't eliminate the batching effect entirely
That said, it's not really possible to guarantee bit-exact outputs across different batch sizes in fp16. This is a general limitation of batched inference, not something specific to our implementation.
I am not 100% sure what is the standard for benchmarking consistency among researchers. Evaluate always at batch size 1 would make the output reproducible/deterministic, but you might better ask it in MME bench repo in case authors recommend a certain fixed batch size