Do Language Models Share Unsafe Directions in Activation Space?
Mohammad Zbeeb PRO
zbeeb
AI & ML interests
KAUST - AUB
Recent Activity
upvoted
a
paper
about 20 hours ago
CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models
updated
a model
1 day ago
zbeeb/pythia-Activations
updated
a collection
1 day ago
Shared Unsafe Directions