[Bug] chat_template: missing <|channel>thought\n<channel|> wrapper for non-thinking SFT / multi-turn

#77

by flotherxi - opened 7 days ago

Summary

Gemma-4 IT (thinking-capable variants: 26B and 31B) uses the following native training format for non-thinking mode, wrapping every assistant turn with an empty thought channel:

<|turn>model\n<|channel>thought\n<channel|>{content}<turn|>\n

The chat template correctly emits this wrapper in the add_generation_prompt=True branch (around line 342-344). However, the message loop that renders existing assistant messages (around line 234+) does not have this logic — creating an asymmetry between the single-turn inference prompt and (a) multi-turn inference prompts, (b) SFT training inputs.

Scope

✅ Affects: Gemma-4 IT 26B and 31B (thinking-capable)
❌ Does NOT affect: Gemma-4 E2B / E4B (their templates have no channel logic at all)
❌ Does NOT affect: thinking mode (enable_thinking=True) — the model generates its own thinking channel

Affected cases

1. Multi-turn inference (`add_generation_prompt=True`, `enable_thinking=False`)

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/gemma-4-31B-it", trust_remote_code=True)

msgs = [
    {"role": "user",      "content": "Hi"},
    {"role": "assistant", "content": "Hello!"},
    {"role": "user",      "content": "what is LLM?"},
]
print(tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True))

Output (note the historical <|turn>model\nHello!<turn|> is missing the wrapper):

<bos><|turn>user
Hi<turn|>
<|turn>model
Hello!<turn|>               ← should be: <|turn>model\n<|channel>thought\n<channel|>Hello!<turn|>
<|turn>user
what is LLM?<turn|>
<|turn>model
<|channel>thought
<channel|>                   ← current generation prompt correctly contains the wrapper

The historical assistant turn is in a format the model was never trained with (missing wrapper), putting the conditioning context slightly out of distribution.

2. SFT training (`add_generation_prompt=False`, `enable_thinking=False`)

msgs = [
    {"role": "user",      "content": "Hi"},
    {"role": "assistant", "content": "Hello!"},
]
print(tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False))

Output:

<bos><|turn>user
Hi<turn|>
<|turn>model
Hello!<turn|>               ← should be: <|turn>model\n<|channel>thought\n<channel|>Hello!<turn|>

When this sequence is used for teacher-forced SFT, the first content token (here Hello) is forced at a position where the model's native prior is p(<|channel>) ≈ 1.0. Per-token cross-entropy at that position collapses to log(vocab_size) ≈ 12.5, and training fails to converge.

Root cause

The add_generation_prompt branch (line ~342-344) emits the marker:

{%- if add_generation_prompt -%}
    ...
    {{- '<|turn>model\n' -}}
    {%- if not enable_thinking | default(false) -%}
        {{- '<|channel>thought\n<channel|>' -}}
    {%- endif -%}
    ...
{%- endif -%}

The message loop (line ~234-238) only emits a channel block when both reasoning AND tool_calls are present on the message — so regular non-thinking assistant messages get no wrapper at all.

Proposed fix

Mirror the add_generation_prompt logic inside the message loop, right after the turn header is emitted:

     {#- Render reasoning/reasoning_content as thinking channel -#}
     {%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
+
+    {#- Non-thinking format: emit empty thought channel wrapper so that rendering an
+       existing assistant turn matches the training-time format produced by the
+       `add_generation_prompt=True` branch below. -#}
+    {%- if role == 'model' and not continue_same_model_turn and not thinking_text and not (enable_thinking | default(false)) -%}
+        {{- '<|channel>thought\n<channel|>' -}}
+    {%- endif -%}
+
     {%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
         {{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
     {%- endif -%}

Guards:

role == 'model': assistant messages only
not continue_same_model_turn: consecutive assistant messages share a single <|turn>model\n header, so inject the marker only on the first one — otherwise a single model turn would contain multiple markers, which is never the training format
not thinking_text: defer to the existing reasoning branch below
not (enable_thinking | default(false)): symmetric with the same guard in the add_generation_prompt branch (in thinking mode the model emits its own channel)

Verification

After the fix, both affected cases produce the training-time format:

Multi-turn inference:

<|turn>model
<|channel>thought
<channel|>Hello!<turn|>     ← historical assistant turn now matches training format

SFT training:

<|turn>model
<|channel>thought
<channel|>Hello!<turn|>     ← content is teacher-forced at the position the model expects

On a single-image SFT smoke test, first-token cross-entropy drops from ≈12.5 to ≈2-3. Thinking-mode cases are unaffected (skipped by the not enable_thinking guard).

google
/

gemma-4-31B-it

[Bug] chat_template: missing <|channel>thought\n<channel|> wrapper for non-thinking SFT / multi-turn

Summary

Scope

Affected cases

1. Multi-turn inference (`add_generation_prompt=True`, `enable_thinking=False`)

2. SFT training (`add_generation_prompt=False`, `enable_thinking=False`)

Root cause

Proposed fix

Verification

Related question

[Bug] chat_template: missing <|channel>thought\n<channel|> wrapper for non-thinking SFT / multi-turn

Summary

Scope

Affected cases

1. Multi-turn inference (add_generation_prompt=True, enable_thinking=False)

2. SFT training (add_generation_prompt=False, enable_thinking=False)

Root cause

Proposed fix

Verification

Related question

1. Multi-turn inference (`add_generation_prompt=True`, `enable_thinking=False`)

2. SFT training (`add_generation_prompt=False`, `enable_thinking=False`)