nvidia/Aegis-AI-Content-Safety-Dataset-2.0
Viewer • Updated • 33.4k • 7.68k • 92
How to use llm-semantic-router/mlcommons-safety-classifier-level2-hazard with PEFT:
from peft import PeftModel
from transformers import AutoModelForSequenceClassification
base_model = AutoModelForSequenceClassification.from_pretrained("jhu-clsp/mmBERT-base")
model = PeftModel.from_pretrained(base_model, "llm-semantic-router/mlcommons-safety-classifier-level2-hazard")A LoRA-finetuned multilingual BERT model for 9-class hazard category classification, following the MLCommons AI Safety Hazard Taxonomy.
This is Level 2 of a hierarchical safety classification system:
The model uses mmBERT (Multilingual ModernBERT) as the base, supporting 1800+ languages.
| Metric | Value |
|---|---|
| Recall (macro) | 93.5% |
| F1 Score (macro) | 93.5% |
| Accuracy | 93.5% |
| Precision (macro) | 93.4% |
| ID | Category | Description |
|---|---|---|
| S1 | Violent Crimes | Murder, assault, kidnapping, terrorism |
| S2 | Non-Violent Crimes | Fraud, theft, cybercrime, drug trafficking |
| S3 | Sex Crimes | Sexual assault, CSAM, sexual exploitation |
| S5 | Weapons & CBRNE | Weapons creation, chemical/biological/nuclear threats |
| S6 | Self-Harm | Suicide, self-injury, eating disorders |
| S7 | Hate | Discrimination, slurs, hate speech |
| S8 | Specialized Advice | Unqualified medical, legal, financial advice |
| S9 | Privacy | PII exposure, surveillance, data harvesting |
| S13 | Misinformation | Disinformation, conspiracy theories, false claims |
The synthetic dataset targets previously underrepresented categories:
| Parameter | Value |
|---|---|
| Rank (r) | 32 |
| Alpha | 64 |
| Dropout | 0.1 |
| Target Modules | attn.Wqkv, attn.Wo, mlp.Wi, mlp.Wo |
| Trainable Parameters | 6.76M (2.15%) |
| Parameter | Value |
|---|---|
| Epochs | 10 |
| Batch Size | 64 |
| Learning Rate | 3e-4 |
| Optimizer | AdamW |
| Scheduler | Linear warmup |
| Component | Specification |
|---|---|
| GPU | AMD Instinct MI300X |
| VRAM | 192GB HBM3 |
| Platform | ROCm 6.2 |
| Container | rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0 |
| Training Time | ~3.5 minutes |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
# Load base model and tokenizer
base_model = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level2-hazard")
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=9)
model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level2-hazard")
# Classify
text = "How to hack into someone's email account"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()
# Label mapping
labels = [
"S1_violent_crimes", "S2_nonviolent_crimes", "S3_sex_crimes",
"S5_weapons_cbrne", "S6_self_harm", "S7_hate",
"S8_specialized_advice", "S9_privacy", "S13_misinformation"
]
print(f"Hazard Category: {labels[prediction]}")
{
"S1_violent_crimes": 0,
"S2_nonviolent_crimes": 1,
"S3_sex_crimes": 2,
"S5_weapons_cbrne": 3,
"S6_self_harm": 4,
"S7_hate": 5,
"S8_specialized_advice": 6,
"S9_privacy": 7,
"S13_misinformation": 8
}
For production use, combine Level 1 and Level 2:
# Step 1: Binary classification (Level 1)
level1_pred = level1_model(inputs)
if level1_pred == "unsafe":
# Step 2: Hazard classification (Level 2)
hazard_category = level2_model(inputs)
This model is designed for:
@misc{mlcommons-safety-classifier,
title={MLCommons AI Safety Classifier},
author={LLM Semantic Router Team},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level2-hazard}
}
Apache 2.0
Base model
jhu-clsp/mmBERT-base