DanielGallagherIRE/fineweb-edu-1B-obfuscated
Viewer • Updated • 1.19M • 346
This model was trained for the purposes of analysing model utility when trained on various Derived Text Formats.
These are versions of the same text that are adjusted to reduce the chances that the original text can ever be extracted from the model, with applications in privacy and copyright infringement protection.
In this case, the model was trained on the dataset after lemmatizing (i.e. converting to base forms) all words.
The dataset used for these experiments is codelion/fineweb-edu-1B, with all obfuscated formats found here.
The model was trained using the following key hyperparameters:
Base model
google-bert/bert-base-cased