[Question] Why does LLaVA evaluation assert batch_size == 1 for benchmark inference?

#62
by austinkomachi - opened

Hi Hugging Face team,

We are evaluating llava-hf/llava-1.5-7b-hf on MME and found a small but reproducible batch-size-dependent variance in greedy generation. We are also using benchmarks MMBench, POPE, TextVQA, ai2d, MMStar, GQA, ChartQA, RefCoCo, DocVQA, but the MME Cognition score gets affected the most, so we specifically did experiments on that.

We are opening this here because our current implementation uses the Hugging Face LlavaForConditionalGeneration / processor path, not the original Liuhaotian LLaVA runtime. We would really appreciate your help investigating whether this is expected numerical behavior, a batching/padding edge case in the HF LLaVA implementation, or something we should configure differently.

Setup

  • Model: llava-hf/llava-1.5-7b-hf
  • Benchmark: MME
  • Decoding: greedy
    • do_sample=False
    • num_beams=1
    • temperature=0
  • Precision: fp16
  • Padding: left padding
  • Comparison: batch size 1 vs batch size 16
  • Task format: yes/no MME questions

Observed Difference

Most predictions match exactly between batch size 1 and batch size 16. However, two MME rows consistently flip:

1. commonsense_reasoning/0042/0

Question:

Are there usually cars in the area shown in the picture? Please answer yes or no.
Gold: yes

  • Batch size 1: Yes
  • Batch size 16: No

2. numerical_calculation/0014/0

Question:

Is the answer to the arithmetic question in the image 200? Please answer yes or no.
Gold: yes

  • Batch size 1: Yes
  • Batch size 16: No

This changes the MME score slightly:
batch size 1
MME-C: ~350.36; total MME: ~1859.87

batch size 16
MME-C: ~340.71; total MME: ~1850.23

Diagnostics We Ran

We built a focused diagnostic harness for the two flipped rows instead of rerunning the full benchmark each time.

We tested:

  • single-row input
  • duplicate batch
  • the two flipped rows together
  • the exact real batch-16 context from the benchmark run
  • padding stress batches with heterogeneous sequence/image contexts

We also tested several generation/runtime settings:

  • default fp16
  • fp16 + explicit position_ids
  • fp16 + eager attention
  • fp16 + eager attention + explicit position_ids
  • fp16 + use_cache=False
  • fp32 diagnostic run

For explicit position_ids, we computed them from the attention mask:

position_ids = attention_mask.long().cumsum(-1) - 1
position_ids = position_ids.masked_fill(attention_mask == 0, 0)

Our findings:

  • Explicit position_ids did not fix the real batch-16 flip.
  • Eager attention did not fix the real batch-16 flip.
  • Eager attention + explicit position_ids did not fix it.
  • use_cache=False did not fix it.
  • fp32 was stable across the tested batch contexts.
  • The fp16 first-token Yes/No logits are extremely close, so small batch-shape/kernel differences appear sufficient to change the argmax.

We also checked the inputs after removing left padding. For the target row, the effective input_ids, position_ids, and pixel_values are identical between the single-row input and the real batch-16 input. That makes us suspect this is not a prompt construction or dataset ordering bug, but likely numerical drift from fp16 batched computation or a model/generation implementation detail.

Questions

  1. Is batch-size-dependent output variance expected for LlavaForConditionalGeneration under fp16 greedy generation?

  2. Did the Hugging Face team observe similar behavior when validating llava-hf/llava-1.5-7b-hf against the original LLaVA results?

  3. For reproducing reported LLaVA benchmark numbers, do you recommend evaluating with batch size 1, or should batched evaluation be considered valid?

  4. Are there recommended settings for stable batched evaluation with HF LLaVA models?

    • bf16 instead of fp16?
    • fp32 for benchmark reporting?
    • eager attention?
    • explicit position_ids?
    • setting explicit tokenizer path?
  5. Would the HF team be able to allocate some time to investigate this? We can provide examples, exact prompts, diagnostic logs, and a minimal reproduction harness if usefuf for all the benchmarks we are running if that's helpful.

This matters for us because we are comparing full-token and compressed-token variants, and we need benchmark scores to be stable across batch sizes. Running all evaluations at batch size 1 is possible but very expensive, and it is not ideal if the expected HF behavior is that batched greedy generation should be equivalent up to tiny numerical differences.

Tagging teammates for context: @ATATC @charlesgchen

Thanks.

Llava Hugging Face org

Hey @austinkomachi , thanks for the detailed report!

This is expected behavior rather than a bug in the HF implementation. The numerical variance might come from two sources:

  1. Batched matmul numerical differences β€” PyTorch uses different CUDA kernels for a single-row input vs a batched one, which leads to tiny floating point differences in the output logits. This is not specific to LLaVA and we can't "fix" it to be deterministic

  2. KV cache β€” we use key-value caching by default, which can also introduce small numerical drift. You can try disabling it (use_cache=False) to see if it shifts things, though it won't eliminate the batching effect entirely

That said, it's not really possible to guarantee bit-exact outputs across different batch sizes in fp16. This is a general limitation of batched inference, not something specific to our implementation.

I am not 100% sure what is the standard for benchmarking consistency among researchers. Evaluate always at batch size 1 would make the output reproducible/deterministic, but you might better ask it in MME bench repo in case authors recommend a certain fixed batch size

Sign up or log in to comment