AetherMind-KD-Student

A Robust and Efficient Knowledge-Distilled Model for Natural Language Inference (NLI)

Repository: samerzaher80/AetherMind-KD-Student
License: MIT


๐Ÿ“˜ Overview

AetherMind-KD-Student is a 184M-parameter Natural Language Inference (NLI) model distilled from a DeBERTa-v3 teacher using a multi-stage, adversarial-aware knowledge distillation pipeline.
The model is designed to provide:

  • High accuracy on standard NLI benchmarks
  • Strong robustness on adversarial datasets
  • Excellent zero-shot generalization to unseen datasets
  • High inference efficiency on consumer GPUs

This makes it suitable for research and practical applications that require fast and reliable sentence-level reasoning.


๐Ÿง  Key Features

โœ” Knowledge Distillation from Large DeBERTa-v3 Teachers

  • Teacher: DeBERTa-v3-based NLI model
  • Student: 184M-parameter transformer
  • Combined objective:
    • 70% KLDivLoss on teacher soft logits
    • 30% CrossEntropyLoss on gold labels
  • Temperature scaling (T โ‰ˆ 3.0) for softened targets

โœ” Multi-Stage Curriculum

Teacher supervision was applied over a curriculum of NLI datasets:

  1. SNLI โ€“ core NLI patterns
  2. MNLI โ€“ multi-domain robustness
  3. ANLI R1โ€“R3 โ€“ adversarial reasoning

โœ” Training Enhancements

  • BalancedBatchSampler to keep entailment/neutral/contradiction distributions balanced per batch
  • Emphasis on contradiction and neutral classes via loss weighting and sampling
  • Careful scheduling and early stopping based on validation performance

๐Ÿ“š Datasets

โœ… Used During Training / Distillation

Dataset Role
SNLI Base NLI training (entailment, neutral, contradiction)
MNLI Multi-genre generalization (matched + mismatched)
ANLI (R1โ€“R3) Adversarial robustness and hard examples

๐Ÿšซ Not Used in Training (Zero-Shot Evaluation Only)

The following datasets were not used during training or distillation. All results on them are pure zero-shot:

Dataset Type Notes
RTE (GLUE) Textual entailment Zero-shot generalization
HANS Heuristic / syntactic bias test Zero-shot
SciTail Science-domain entailment Evaluated in binary setting
XNLI (English) Cross-lingual NLI test Zero-shot on English split

๐Ÿ— Model Architecture

The model follows a compact transformer architecture:

  • 12 Transformer encoder layers
  • Hidden size: 768
  • 12 attention heads
  • Intermediate feed-forward size as in BERT/DeBERTa-base-style models
  • Final classification head with 3 output logits:
    • 0 = entailment
    • 1 = neutral
    • 2 = contradiction

Total parameters: 184,424,451

The design target is to match or exceed the performance of larger teacher models while remaining efficient enough for real-time inference on a single consumer GPU.


๐Ÿ”ฅ Knowledge Distillation Strategy

Objective

The total loss is a weighted combination:

  • Knowledge Distillation Loss (KLDivLoss)
    • Encourages student logits to match the teacherโ€™s softened output distribution
  • Supervised Loss (CrossEntropy)
    • Encourages correct prediction of the gold label

Formally:

L_total = 0.7 ยท L_KD + 0.3 ยท L_CE

where L_KD uses temperature-scaled teacher logits.

Additional Techniques

  • Balanced batches w.r.t. class labels
  • Emphasis on contradiction / neutral examples during later stages
  • Adversarial samples from ANLI to harden reasoning under distribution shifts

๐Ÿ“Š Evaluation Results

1๏ธโƒฃ Core NLI Benchmarks

Dataset Split Accuracy Macro-F1
MNLI (matched) validation 90.47% 90.42%
MNLI (mismatched) validation 90.12% 90.07%
SNLI test ~88โ€“89% ~88โ€“89%

2๏ธโƒฃ Adversarial NLI (ANLI)

Dataset Split Accuracy Macro-F1
ANLI R1 test_r1 73.60% 73.61%
ANLI R2 test_r2 57.70% 57.60%
ANLI R3 test_r3 53.67% 53.68%

These scores indicate strong robustness, especially considering the modelโ€™s size.


3๏ธโƒฃ Zero-Shot Generalization

These datasets were never seen during training. All scores are zero-shot.

RTE (GLUE)

  • Accuracy: 86.28%
  • Macro-F1: 86.20%

HANS

  • Accuracy: 77.74%
  • Macro-F1: 76.60%

The strong performance on HANS suggests reduced dependence on shallow lexical heuristics.

SciTail (Binary Setting)

SciTail originally has entailment vs neutral classes. For evaluation, the modelโ€™s 3-way logits are mapped to:

  • Entailment โ†’ entailment
  • Neutral + contradiction โ†’ non-entailment
Split Accuracy Macro-F1
Train 82.37% 80.99%
Dev 78.83% 78.81%

XNLI (English, zero-shot)

  • Accuracy: 90.92%
  • Macro-F1: 90.94%

This demonstrates strong cross-domain and cross-benchmark generalization, even without explicit multilingual or XNLI-specific training.

Results

Task Dataset Split Accuracy Macro-F1
Natural Language Inference MNLI (matched) validation 90.47% 90.42%
Natural Language Inference MNLI (mismatched) validation 90.12% 90.07%
Natural Language Inference SNLI test ~88โ€“89% ~88โ€“89%
Adversarial NLI ANLI R1 test_r1 73.60% 73.61%
Adversarial NLI ANLI R2 test_r2 57.70% 57.60%
Adversarial NLI ANLI R3 test_r3 53.67% 53.68%
Zero-shot RTE (GLUE) validation 86.28% 86.20%
Zero-shot HANS validation 77.74% 76.60%
Zero-shot (binary) SciTail dev 78.83% 78.81%
Zero-shot XNLI (English) test 90.92% 90.94%

โšก Efficiency

Metric Value
Total parameters 184,424,451
Inference speed โ‰ˆ 308.51 samples/second
Hardware RTX 3050 (8 GB), CUDA 11.8

These numbers make the model a good choice for production environments and large-scale batch inference.


๐Ÿงช Intended Use

Recommended Uses

  • Research on NLI, robustness, and knowledge distillation
  • As a drop-in NLI component for:
    • Scientific text understanding
    • Claim verification prototypes
    • General English reasoning tasks
  • Zero-shot probing on new NLI-style benchmarks

Not Recommended For

  • Safety-critical applications (medical diagnosis, legal decisions, etc.) without human experts in the loop
  • High-stakes multilingual use cases (model is trained and validated on English only)
  • Long-document reasoning beyond typical transformer context length

โš  Limitations

  • Performance on ANLI R3 remains challenging, consistent with broader model behavior in the literature
  • No dedicated multilingual training (XNLI non-English languages not evaluated)
  • No explicit calibration of probabilities (users may wish to post-calibrate logits/probabilities)

๐Ÿ”ฎ Future Work

Planned and possible future enhancements include:

  • Adversarial fine-tuning specifically for ANLI R3
  • Cross-lingual extensions using full XNLI
  • Domain adapters for biomedical and clinical NLI (e.g., MedNLI)
  • Integration in larger cognitive reasoning systems with memory and tool-use (outside the scope of this model card)

๐Ÿ“ฆ Files in This Repository

  • config.json โ€“ model configuration
  • model.safetensors โ€“ model weights
  • tokenizer.json โ€“ tokenizer model
  • tokenizer_config.json โ€“ tokenizer configuration
  • special_tokens_map.json โ€“ special tokens metadata
  • spm.model โ€“ SentencePiece model (if applicable)
  • added_tokens.json โ€“ additional tokens (if any)
  • training_args.bin โ€“ training arguments (optional, for reproducibility)
  • trainer_state.json โ€“ trainer state (optional, for reproducibility)

๐Ÿ’ป Usage Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "samerzaher80/AetherMind-KD-Student"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

premise = "A cat is sleeping on the sofa."
hypothesis = "An animal is resting indoors."

inputs = tokenizer(premise, hypothesis, return_tensors="pt")
outputs = model(**inputs)

logits = outputs.logits
pred = logits.argmax(dim=-1).item()

id2label = {0: "entailment", 1: "neutral", 2: "contradiction"}
print(id2label[pred])

๐Ÿ“œ Citation

If you use this model in your research, please cite:

@misc{aethermind2025kdstudent,
  title        = {AetherMind-KD-Student: A Robust and Efficient Knowledge-Distilled NLI Model},
  author       = {Sameer S. Najm},
  year         = {2025},
  howpublished = {Hugging Face model repository},
  note         = {\url{https://huggingface.co/samerzaher80/AetherMind-KD-Student}}
}

๐Ÿ‘ค Author

Sameer S. Najm
AI Researcher & Founder, Sam IT Solutions โ€“ Iraq


๐Ÿชช License

This model is released under the MIT License.

Downloads last month
6
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Datasets used to train samerzaher80/AetherMind-KD-Student