Instructions to use llava-hf/llava-1.5-7b-hf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use llava-hf/llava-1.5-7b-hf with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="llava-hf/llava-1.5-7b-hf")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = AutoModelForImageTextToText.from_pretrained("llava-hf/llava-1.5-7b-hf")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use llava-hf/llava-1.5-7b-hf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "llava-hf/llava-1.5-7b-hf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llava-hf/llava-1.5-7b-hf",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/llava-hf/llava-1.5-7b-hf

SGLang

How to use llava-hf/llava-1.5-7b-hf with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "llava-hf/llava-1.5-7b-hf" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llava-hf/llava-1.5-7b-hf",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "llava-hf/llava-1.5-7b-hf" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llava-hf/llava-1.5-7b-hf",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use llava-hf/llava-1.5-7b-hf with Docker Model Runner:
```
docker model run hf.co/llava-hf/llava-1.5-7b-hf
```

[Question] Why does LLaVA evaluation assert batch_size == 1 for benchmark inference?

#62

by austinkomachi - opened 11 days ago

Discussion

austinkomachi

11 days ago

•

edited 11 days ago

Hi Hugging Face team,

We are evaluating llava-hf/llava-1.5-7b-hf on MME and found a small but reproducible batch-size-dependent variance in greedy generation. We are also using benchmarks MMBench, POPE, TextVQA, ai2d, MMStar, GQA, ChartQA, RefCoCo, DocVQA, but the MME Cognition score gets affected the most, so we specifically did experiments on that.

We are opening this here because our current implementation uses the Hugging Face LlavaForConditionalGeneration / processor path, not the original Liuhaotian LLaVA runtime. We would really appreciate your help investigating whether this is expected numerical behavior, a batching/padding edge case in the HF LLaVA implementation, or something we should configure differently.

Setup

Model: llava-hf/llava-1.5-7b-hf
Benchmark: MME
Decoding: greedy
- do_sample=False
- num_beams=1
- temperature=0
Precision: fp16
Padding: left padding
Comparison: batch size 1 vs batch size 16
Task format: yes/no MME questions

Observed Difference

Most predictions match exactly between batch size 1 and batch size 16. However, two MME rows consistently flip:

1. commonsense_reasoning/0042/0

Question:

Are there usually cars in the area shown in the picture? Please answer yes or no.
Gold: yes

Batch size 1: Yes
Batch size 16: No

2. numerical_calculation/0014/0

Question:

Is the answer to the arithmetic question in the image 200? Please answer yes or no.
Gold: yes

Batch size 1: Yes
Batch size 16: No

This changes the MME score slightly:
batch size 1
MME-C: ~350.36; total MME: ~1859.87

batch size 16
MME-C: ~340.71; total MME: ~1850.23

Diagnostics We Ran

We built a focused diagnostic harness for the two flipped rows instead of rerunning the full benchmark each time.

We tested:

single-row input
duplicate batch
the two flipped rows together
the exact real batch-16 context from the benchmark run
padding stress batches with heterogeneous sequence/image contexts

We also tested several generation/runtime settings:

default fp16
fp16 + explicit position_ids
fp16 + eager attention
fp16 + eager attention + explicit position_ids
fp16 + use_cache=False
fp32 diagnostic run

For explicit position_ids, we computed them from the attention mask:

position_ids = attention_mask.long().cumsum(-1) - 1
position_ids = position_ids.masked_fill(attention_mask == 0, 0)

Our findings:

Explicit position_ids did not fix the real batch-16 flip.
Eager attention did not fix the real batch-16 flip.
Eager attention + explicit position_ids did not fix it.
use_cache=False did not fix it.
fp32 was stable across the tested batch contexts.
The fp16 first-token Yes/No logits are extremely close, so small batch-shape/kernel differences appear sufficient to change the argmax.

We also checked the inputs after removing left padding. For the target row, the effective input_ids, position_ids, and pixel_values are identical between the single-row input and the real batch-16 input. That makes us suspect this is not a prompt construction or dataset ordering bug, but likely numerical drift from fp16 batched computation or a model/generation implementation detail.

Questions

Is batch-size-dependent output variance expected for LlavaForConditionalGeneration under fp16 greedy generation?
Did the Hugging Face team observe similar behavior when validating llava-hf/llava-1.5-7b-hf against the original LLaVA results?
For reproducing reported LLaVA benchmark numbers, do you recommend evaluating with batch size 1, or should batched evaluation be considered valid?
Are there recommended settings for stable batched evaluation with HF LLaVA models?
- bf16 instead of fp16?
- fp32 for benchmark reporting?
- eager attention?
- explicit position_ids?
- setting explicit tokenizer path?
Would the HF team be able to allocate some time to investigate this? We can provide examples, exact prompts, diagnostic logs, and a minimal reproduction harness if usefuf for all the benchmarks we are running if that's helpful.

This matters for us because we are comparing full-token and compressed-token variants, and we need benchmark scores to be stable across batch sizes. Running all evaluations at batch size 1 is possible but very expensive, and it is not ideal if the expected HF behavior is that batched greedy generation should be equivalent up to tiny numerical differences.

Tagging teammates for context: @ATATC @charlesgchen

Thanks.

RaushanTurganbay

Llava Hugging Face org 10 days ago

Hey @austinkomachi , thanks for the detailed report!

This is expected behavior rather than a bug in the HF implementation. The numerical variance might come from two sources:

Batched matmul numerical differences — PyTorch uses different CUDA kernels for a single-row input vs a batched one, which leads to tiny floating point differences in the output logits. This is not specific to LLaVA and we can't "fix" it to be deterministic
KV cache — we use key-value caching by default, which can also introduce small numerical drift. You can try disabling it (use_cache=False) to see if it shifts things, though it won't eliminate the batching effect entirely

That said, it's not really possible to guarantee bit-exact outputs across different batch sizes in fp16. This is a general limitation of batched inference, not something specific to our implementation.

I am not 100% sure what is the standard for benchmarking consistency among researchers. Evaluate always at batch size 1 would make the output reproducible/deterministic, but you might better ask it in MME bench repo in case authors recommend a certain fixed batch size

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment