--- license: mit datasets: - UW/olmo-mix-1124-subset-p99 --- We developed this SuperBPE tokenizer for model developers who wish to experiment quickly with an off-the-shelf tokenizer in their pretraining pipeline! This is an English SuperBPE tokenizer with a vocab size of 128K, trained on a subset of the Olmo2 pretraining data. You can experiment with this tokenizer on our [tokenizer playground](https://superbpe.github.io/) by entering a custom HF repository ID.