Incorrectly tokenizes bracketed inorganic SMILES (e.g., [O-2].[Zn-].[Zn-])
I found that the Intern-S1 tokenizer breaks inorganic SMILES that contain fully bracketed ions.
Example:
[O-2].[Zn-].[Zn-]
Tokenizer output:
['[O', '-', '2', ']', '.[', 'Z', 'n', '-]', '.[', 'Z', 'n', '-]']
Issues:
- multi-character atom symbols inside brackets (
Zn) are split - bracketed ions (
[O-2],[Zn-]) are not treated as atomic units - symbols like
.[appear due to incorrect merging
In contrast, the tokenizer handles organic SMILES correctly:
O=C([O-])c1ccc(C(=O)[O-])c2c1CC2
→ ['O=C(', '[O-]', ')c', '1', ...]
How to treat bracketed ions (e.g., [Zn-], [O-2], [Fe+2], etc.) as single tokens the same way [O-] is handled?
The SMILES tokenization is trained on relevant corpora to combine words based on word frequency rather than to guarantee segmentation according to atomic units or ions. In fact, the vocabulary only contains "[O-]" and not "[Zn-]", the latter of which would be split into combinations of multiple lexical units. However, this is how it was trained, so the model will automatically "recognize" these combinations.
Therefore, the answer is that model cannot distinguish ions like [Zn-], and the best course of action is to leave it as is :)