BAD Classifier (FairSteer): Llama-2-7b-chat-hf Layer 25 (RAW Mode)

This Biased Activation Detector (BAD) was trained using the FairSteer methodology.

Model Metadata

  • Base Model: meta-llama/Llama-2-7b-chat-hf
  • Optimal Layer: 25
  • Validation Accuracy: 75.76%
  • Extraction Mode: RAW (Directly matches FairSteer GitHub logic)
  • Protocol: 1:1 Balanced Undersampling (Scenario-Grouped)

Usage

  1. Extract residual stream activation [:, -1, :] from Layer 25.
  2. Pass raw activation directly to model.safetensors.
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support