MP3 Bitrate Sensitivity in Audio-Multimodal LLMs: A Per-Model Evaluation

Community Article Published April 17, 2026

header

4

Effective speech-to-text and transcription applications rely heavily upon fast inference and minimal latency.

Two approaches are traditionally considere:

The first is running inference locally, which has the disadvantage of requiring expensive hardware for optimal performance, or alternatively, forcing users to use less ideal and powerful models.

This challenge is compounded in the audio-multimodal environment, where models are often significantly larger than traditional ASR models.

For those building transcription utilities with audio and text tokens being processed for transcription jobs in the cloud, inference speed becomes more challenging to architect.

One way to minimize that is effective system prompting - providing deterministic formatting instructions as turgidly as possible. The other is technical:providing audio to the API endpoint at the lowest possible bit rate.

The rationale: the smaller gap between what's supplied to the inference pipeline and what the inference pipeline will normalise to anyway, downsampling can be eliminated and uploads are faster. If WER isn't affected, it's a win-win.

But first: What's wrong with Whisper!? (Spoiler: nothing!)

Conventional production transcription pipelines are two-stage:

  • Stage one is an ASR system — Whisper, Deepgram, AssemblyAI, or similar — which emits a token-level hypothesis, typically raw and unpunctuated in the sense that matters for downstream consumption: filler words intact, run-on segments, no paragraph structure, no normalized casing for proper nouns.

  • Stage two is a text-only LLM that takes the ASR output and applies editorial cleanup: strips hesitations, corrects homophones, applies punctuation, segments paragraphs, fixes the occasional misheard word via semantic context. Both stages are independently tunable and independently chargeable.

Audio-multimodal LLMs collapse that pipeline to a single pass:

The same model that consumes the audio tokens also emits the cleaned, formatted text — the ASR decoder and the cleanup LLM are the same network. For dictation-style workloads this has concrete advantages: one API call instead of two, one prompt instead of two, roughly half the latency, and — critically — the cleanup stage has access to acoustic features (prosody, disfluency patterns, speaker emphasis) that a two-stage pipeline throws away at the ASR boundary.

The trade-off is that these models are newer, their behavior is less characterized than Whisper's, and their audio encoders are not public. Decisions that are well-established for ASR systems — sample rate, codec, bitrate — are open questions for audio-multimodal. This eval addresses one such open question.

Question

How sensitive is transcription accuracy to MP3 bitrate across the audio-input models currently accessible via OpenRouter, and how do those models differ from each other on accuracy, latency, and instruction adherence?

Setup

  • 12 models — every audio-capable model in OpenRouter's catalog as of April 2026: Gemini 2.0 Flash/Flash-Lite, Gemini 2.5 Flash/Flash-Lite/Pro, Gemini 3 Flash Preview, Gemini 3.1 Flash Lite Preview, GPT-Audio, GPT-Audio-Mini, GPT-4o-Audio-Preview, Voxtral Small 24B (Mistral), MiMo V2 Omni (Xiaomi).
  • 4 dictation samples — 20–30 second clips of native-English read-aloud prose. Lexicon deliberately bounded to unambiguous tokens: no acronyms, no digit-versus-word-form numbers, no code, no URLs, no inconsistently-cased proper nouns. This constraint prevents formatting ambiguity from polluting WER measurements.
  • 5 MP3 bitrates — 16, 24, 32, 48, 64 kbps. Constant bitrate (CBR), mono, 16 kHz sample rate held constant across all variants. Encoded via pydub/LAME from 16-bit PCM source WAVs.
  • Verbatim transcription prompt — explicit instruction to transcribe exactly what was spoken with minimal intervention (sentence punctuation and capitalization only). This isolates audio-encoding effects from prompt-following variance; differences between conditions should reflect the acoustic representation the model consumed, not editorial choices.
  • Scoring: WER = Levenshtein edit distance over lowercased whitespace-split word tokens, normalized by reference length.
  • Latency: client-side wall-clock round-trip for the POST, measured from Jerusalem, Israel.

12 × 4 × 5 = 240 API calls. Aggregate cost ≈ $0.25 at April 2026 OpenRouter rates.

Finding 1: Compression below the perceptual-audio threshold is safe for most models

1

For the Gemini and Voxtral families, WER is statistically flat across the 16-64 kbps range. Per-bitrate variance within a given model exceeds the trend across bitrates; slopes are indistinguishable from zero at n=4 samples per cell. The heatmap makes the model × bitrate grid legible:

2

The operational implication for transcription pipeline builders is unambiguous: the default upload bitrate in most dictation apps is substantially overspecified. 64 kbps MP3 — roughly the perceptual-audio floor for music listening — carries no transcription-accuracy benefit over 32 kbps for speech content on Gemini or Voxtral, and sending it wastes about 2× the bandwidth. 24 kbps is likely also safe; 16 kbps is the point at which model-specific testing becomes warranted before committing.

This should not surprise anyone who has looked at the information content of speech — the useful spectral bands sit below 4 kHz and a 16 kbps MP3 at 16 kHz sample rate preserves the formants ASR systems depend on. But the assumption and the measurement are different things, and the measurement for audio-multimodal specifically did not previously exist in the public record.

Further compression (Opus at 8-16 kbps, which outperforms MP3 at equivalent bitrates for speech) is not accessible via OpenRouter's OpenAI-compatible input_audio schema, which accepts "wav" and "mp3" only. Opus would require bypassing OR for providers that expose it natively (Gemini's first-party API does).

Finding 2: Large cross-model differences in accuracy and latency

The relevant optimization surface for a transcription pipeline is accuracy × latency × cost:

3

Three notable model clusters:

mistralai/voxtral-small-24b-2507: average WER ≈ 0.02, average latency ≈ 1.0s. Fastest model in the panel by a significant margin — 2-8× faster than comparable-accuracy Gemini variants. On latency-sensitive pipelines (live dictation with visible response time), this is the model to beat. Context window (32k) becomes a constraint for clips over ~15 minutes.

google/gemini-3-flash-preview: average WER ≈ 0.014, average latency ≈ 2.2s. Best accuracy in the panel, consistent across all bitrates. Appropriate default when accuracy is prioritized over latency.

google/gemini-2.5-pro: average WER ≈ 0.018, average latency ≈ 7.2s, significantly higher cost. Strictly dominated by Gemini 3 Flash Preview for this workload. Dictation does not benefit from reasoning-model capabilities; the additional capacity is unused. Not recommended for transcription.

4

Latency values include network round-trip and are specific to the test location. Absolute latencies will differ elsewhere; the ordering should be broadly stable since it reflects serving infrastructure differences rather than routing.

Finding 3: Instruction adherence varies substantially — and it matters more than compression

The most operationally significant finding is not about bitrate at all. It's about whether the model does what the prompt asks.

5

The Gemini and Voxtral WER distributions are tight — median ≈ 0.02, narrow IQR, no tail. The three OpenAI audio models (GPT-Audio, GPT-Audio-Mini, GPT-4o-Audio-Preview) show bimodal behavior: a cluster of calls with WER ≈ 0.02 (as tight as Gemini) and a second cluster of calls with WER ≈ 0.9-1.2. Inspection of the outlier calls reveals the failure mode: the model treats the audio as conversational input and emits a response to the content rather than a transcription of it.

Example — GPT-Audio-Mini, sample 2, 16 kbps:

Reference (text the speaker read): "My grandmother used to make soup from whatever was in the kitchen on a Sunday afternoon. Carrots, a little onion, sometimes a handful of barley if she remembered to buy it…"

Model output: "That's a beautiful description. It paints a vivid picture of the scene—your grandmother's methodical and careful preparation, the simple ingredients, and the comforting aroma filling the apartment…"

WER on this call: 0.96. The same audio at the same bitrate produced WER = 0 on Gemini 2.0 Flash Lite and WER = 0.014 on Voxtral. The audio was fine. The model decoded the acoustic signal correctly. It then chose to generate conversationally despite the explicit system prompt:

Transcribe the audio VERBATIM.
- Write exactly what was said, word for word, in the order spoken.
...
- Do NOT rephrase, summarize, or reformat.
- Output plain text only. No preamble, no commentary.

This is a prompt-adherence failure, not an audio-understanding failure. The implications for production transcription pipelines are significant:

  1. Verbatim transcription of open-form speech is not a reliable capability across the audio-multimodal landscape. Models within a single provider share this behavior — all three OpenAI audio variants tested exhibit it. Models across different providers do not — neither Gemini nor Voxtral nor MiMo produce conversationalization outputs in this eval.
  2. Output validation becomes non-optional when using GPT-Audio-family models for transcription. A length-ratio check (model output token count vs. expected transcript length given audio duration) or a semantic-coherence check (output substantially matches audio content) would catch most failures, at the cost of a second model call that partly defeats the one-pass architectural advantage.
  3. Provider selection for transcription workloads should be evaluated on instruction-adherence specifically, not just accuracy on successful calls. WER averaged across all calls (as in Plot 3) penalizes unreliable models correctly; WER conditioned on "call succeeded" would have shown OpenAI models as competitive, which they are not in practice.

The failure rate for conversationalization in this eval is approximately 25-40% across the three OpenAI models, varying by sample (narrative prose appears to elicit it more than task-oriented content). At that rate, the one-pass architecture is no longer a win — the probability of pipeline failure exceeds the value of eliminating the second stage.

Recommendations for transcription pipeline builders

  1. Reduce upload bitrate to 32 kbps MP3 mono 16 kHz unless eval data against your own samples shows a model-specific regression. Most production dictation pipelines are over-provisioned on audio quality by 2-4×. Payload reduction is free bandwidth.

  2. 16-24 kbps is a worthwhile test for high-volume pipelines where bandwidth dominates cost. Expect possible but small accuracy regressions on specific models; eval before deploying.

  3. Do not send 44.1 or 48 kHz audio to audio-multimodal LLMs. The encoders operate on 16 kHz (or lower) representations; higher sample rates are server-side resampled and waste bandwidth.

  4. For latency-sensitive transcription workloads, default to mistralai/voxtral-small-24b-2507 at 24-32 kbps. Nothing else in the OpenRouter catalog matches its latency at comparable accuracy.

  5. For accuracy-sensitive workloads, default to google/gemini-3-flash-preview at 32 kbps. Lowest WER in the panel, consistent across bitrates, latency acceptable for non-interactive workflows.

  6. Avoid GPT-Audio-family models for verbatim transcription without output validation. The conversationalization failure mode is too common to rely on. Use them for audio understanding tasks (captioning, content analysis) where conversational output is wanted; use Gemini or Voxtral when verbatim transcription is wanted.

  7. Audit your own pipeline for this failure mode if currently using GPT-Audio variants in production. Length-ratio checks are the cheapest defense; a token-count output much lower than the audio duration would suggest is a strong signal of conversationalization.

Caveats

n=4 samples per cell — sufficient for the effect sizes observed, insufficient for tight confidence intervals on small deltas. All recordings share a speaker (native English, Israeli-inflected), microphone (USB condenser, consumer-grade), and acoustic environment (quiet indoor room). Results will not generalize cleanly to accented speech, noisy environments, multi-speaker conversation, or languages other than English without re-testing. Latency numbers are region-specific (routing from Israel to OpenRouter to upstream providers).

Dataset

The full dataset — source WAVs, every MP3 variant at byte-level, per-call transcriptions, all.csv of 240 calls — is published under MIT on GitHub and mirrored as a Hugging Face dataset. The eval harness (evals/full_sweep.py in the tooling repo) is reusable against your own samples with a single CLI argument change.

Community

Sign up or log in to comment