[Bug] chat_template: missing <|channel>thought\n<channel|> wrapper for non-thinking SFT / multi-turn

#77
by flotherxi - opened

Summary

Gemma-4 IT (thinking-capable variants: 26B and 31B) uses the following native training format for non-thinking mode, wrapping every assistant turn with an empty thought channel:

<|turn>model\n<|channel>thought\n<channel|>{content}<turn|>\n

The chat template correctly emits this wrapper in the add_generation_prompt=True branch (around line 342-344). However, the message loop that renders existing assistant messages (around line 234+) does not have this logic β€” creating an asymmetry between the single-turn inference prompt and (a) multi-turn inference prompts, (b) SFT training inputs.

Scope

  • βœ… Affects: Gemma-4 IT 26B and 31B (thinking-capable)
  • ❌ Does NOT affect: Gemma-4 E2B / E4B (their templates have no channel logic at all)
  • ❌ Does NOT affect: thinking mode (enable_thinking=True) β€” the model generates its own thinking channel

Affected cases

1. Multi-turn inference (add_generation_prompt=True, enable_thinking=False)

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/gemma-4-31B-it", trust_remote_code=True)

msgs = [
    {"role": "user",      "content": "Hi"},
    {"role": "assistant", "content": "Hello!"},
    {"role": "user",      "content": "what is LLM?"},
]
print(tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True))

Output (note the historical <|turn>model\nHello!<turn|> is missing the wrapper):

<bos><|turn>user
Hi<turn|>
<|turn>model
Hello!<turn|>               ← should be: <|turn>model\n<|channel>thought\n<channel|>Hello!<turn|>
<|turn>user
what is LLM?<turn|>
<|turn>model
<|channel>thought
<channel|>                   ← current generation prompt correctly contains the wrapper

The historical assistant turn is in a format the model was never trained with (missing wrapper), putting the conditioning context slightly out of distribution.

2. SFT training (add_generation_prompt=False, enable_thinking=False)

msgs = [
    {"role": "user",      "content": "Hi"},
    {"role": "assistant", "content": "Hello!"},
]
print(tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False))

Output:

<bos><|turn>user
Hi<turn|>
<|turn>model
Hello!<turn|>               ← should be: <|turn>model\n<|channel>thought\n<channel|>Hello!<turn|>

When this sequence is used for teacher-forced SFT, the first content token (here Hello) is forced at a position where the model's native prior is p(<|channel>) β‰ˆ 1.0. Per-token cross-entropy at that position collapses to log(vocab_size) β‰ˆ 12.5, and training fails to converge.

Root cause

The add_generation_prompt branch (line ~342-344) emits the marker:

{%- if add_generation_prompt -%}
    ...
    {{- '<|turn>model\n' -}}
    {%- if not enable_thinking | default(false) -%}
        {{- '<|channel>thought\n<channel|>' -}}
    {%- endif -%}
    ...
{%- endif -%}

The message loop (line ~234-238) only emits a channel block when both reasoning AND tool_calls are present on the message β€” so regular non-thinking assistant messages get no wrapper at all.

Proposed fix

Mirror the add_generation_prompt logic inside the message loop, right after the turn header is emitted:

     {#- Render reasoning/reasoning_content as thinking channel -#}
     {%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
+
+    {#- Non-thinking format: emit empty thought channel wrapper so that rendering an
+       existing assistant turn matches the training-time format produced by the
+       `add_generation_prompt=True` branch below. -#}
+    {%- if role == 'model' and not continue_same_model_turn and not thinking_text and not (enable_thinking | default(false)) -%}
+        {{- '<|channel>thought\n<channel|>' -}}
+    {%- endif -%}
+
     {%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
         {{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
     {%- endif -%}

Guards:

  • role == 'model': assistant messages only
  • not continue_same_model_turn: consecutive assistant messages share a single <|turn>model\n header, so inject the marker only on the first one β€” otherwise a single model turn would contain multiple markers, which is never the training format
  • not thinking_text: defer to the existing reasoning branch below
  • not (enable_thinking | default(false)): symmetric with the same guard in the add_generation_prompt branch (in thinking mode the model emits its own channel)

Verification

After the fix, both affected cases produce the training-time format:

Multi-turn inference:

<|turn>model
<|channel>thought
<channel|>Hello!<turn|>     ← historical assistant turn now matches training format

SFT training:

<|turn>model
<|channel>thought
<channel|>Hello!<turn|>     ← content is teacher-forced at the position the model expects

On a single-image SFT smoke test, first-token cross-entropy drops from β‰ˆ12.5 to β‰ˆ2-3. Thinking-mode cases are unaffected (skipped by the not enable_thinking guard).


Related question

The smaller sibling models Gemma-4 E2B / E4B have no <|channel>thought\n<channel|> wrapper logic in their templates at all for non-thinking mode β€” neither the add_generation_prompt=True branch nor the message loop emits it.

Could the model authors clarify which is intended:

  • Option A: E2B/E4B were trained without the empty wrapper (no thinking architecture β†’ no marker needed by design), while 26B/31B were trained with it. In this case the current PR scope (only 26B/31B) is correct.
  • Option B: The training format is actually identical across all sizes, and E2B/E4B templates are missing the same logic. In this case E2B/E4B templates need the same fix applied.

Knowing which is the ground truth would help us decide whether this PR should be extended to the smaller variants' chat templates as well.

Google org

Hi @flotherxi ,

Thank you for bringing this to our attention. I was able to reproduce the issue for this particular edge case and have shared it with our Engineering team for further review.

Sign up or log in to comment