Deployment: L25 RAW Probe | Acc: 75.76%

Files changed (4) hide show

README.md ADDED Viewed

+---
+license: apache-2.0
+library_name: transformers
+tags:
+- fairsteer
+- bias-detection
+- interpretability
+- llama-2
+- linear-probe
+---
+# BAD Classifier (FairSteer): Llama-2-7b-chat-hf Layer 25 (RAW Mode)
+This **Biased Activation Detector (BAD)** was trained using the FairSteer methodology.
+## Model Metadata
+- **Base Model:** `meta-llama/Llama-2-7b-chat-hf`
+- **Optimal Layer:** 25
+- **Validation Accuracy:** 75.76%
+- **Extraction Mode:** RAW (Directly matches FairSteer GitHub logic)
+- **Protocol:** 1:1 Balanced Undersampling (Scenario-Grouped)
+## Usage
+1. Extract residual stream activation `[:, -1, :]` from Layer 25.
+2. Pass raw activation directly to `model.safetensors`.

config.json ADDED Viewed

+{
+    "metadata": {
+        "project": "FairSteer Llama-2 Debiasing",
+        "timestamp": "20260111_1145",
+        "optimal_layer": 25,
+        "val_accuracy": 0.7575601374570446,
+        "extraction_mode": "raw"
+    },
+    "model_config": {
+        "base_model": "meta-llama/Llama-2-7b-chat-hf",
+        "input_dim": 4096,
+        "architecture": "LinearProbe"
+    }
+}

model.py ADDED Viewed

+import torch
+import torch.nn as nn
+class BADClassifier(nn.Module):
+    """
+    Linear Probe for Biased Activation Detection (FairSteer Standard).
+    Optimized for Llama-2-7b-chat-hf - Layer 25.
+    """
+    def __init__(self, input_dim=4096):
+        super().__init__()
+        self.linear = nn.Linear(input_dim, 1)
+    def forward(self, x):
+        # Direct linear pass matching cuml.LogisticRegression
+        return self.linear(x)

model.safetensors ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:1d2cebab4f6afde619bdb04044a5c304738016d9d72c01996c0a7d7ac6a7e3c7
+size 16540