Incorrectly tokenizes bracketed inorganic SMILES (e.g., [O-2].[Zn-].[Zn-])

by MianzhiPan - opened Nov 16, 2025

Nov 16, 2025

I found that the Intern-S1 tokenizer breaks inorganic SMILES that contain fully bracketed ions.

Example:

[O-2].[Zn-].[Zn-]

Tokenizer output:

['[O', '-', '2', ']', '.[', 'Z', 'n', '-]', '.[', 'Z', 'n', '-]']

Issues:

multi-character atom symbols inside brackets (Zn) are split
bracketed ions ([O-2], [Zn-]) are not treated as atomic units
symbols like .[ appear due to incorrect merging

In contrast, the tokenizer handles organic SMILES correctly:

O=C([O-])c1ccc(C(=O)[O-])c2c1CC2
→ ['O=C(', '[O-]', ')c', '1', ...]

How to treat bracketed ions (e.g., [Zn-], [O-2], [Fe+2], etc.) as single tokens the same way [O-] is handled?

Zhangyc02

Nov 18, 2025

The SMILES tokenization is trained on relevant corpora to combine words based on word frequency rather than to guarantee segmentation according to atomic units or ions. In fact, the vocabulary only contains "[O-]" and not "[Zn-]", the latter of which would be split into combinations of multiple lexical units. However, this is how it was trained, so the model will automatically "recognize" these combinations.

Therefore, the answer is that model cannot distinguish ions like [Zn-], and the best course of action is to leave it as is :)

MianzhiPan changed discussion status to closed Dec 7, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment