Spaces:

OliverPerrin
/

LexiMind

Sleeping

App Files Files Community

OliverPerrin commited on 11 days ago

Commit

1ec7405

1 Parent(s): 701dfc6

Update LexiMind: improved training, model architecture, and evaluation

Browse files

Files changed (28) hide show

README.md +72 -49
configs/config.yaml +1 -0
configs/model/base.yaml +7 -3
configs/training/dev.yaml +23 -13
configs/training/full.yaml +22 -11
configs/training/medium.yaml +23 -13
docs/api.md +0 -79
docs/architecture.md +10 -0
docs/training.md +0 -80
outputs/training_history.json +51 -51
scripts/demo_gradio.py +222 -98
scripts/download_data.py +53 -0
scripts/process_books.py +231 -0
scripts/train.py +56 -8
scripts/visualize_training.py +341 -0
src/data/dataloader.py +7 -4
src/inference/pipeline.py +3 -0
src/inference/postprocessing.py +0 -14
src/models/decoder.py +57 -16
src/models/encoder.py +28 -6
src/models/factory.py +57 -29
src/models/t5_layer_norm.py +41 -0
src/training/early_stopping.py +60 -0
src/training/gradient_monitor.py +102 -0
src/training/safe_compile.py +38 -72
src/training/trainer.py +158 -3
tests/test_inference/test_pipeline.py +26 -4
tests/test_models/test_visualizations.py +1 -1

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ app_file: scripts/demo_gradio.py
 pinned: false
 ---
-# LexiMind: A Multi-Task NLP Model
 LexiMind is a state-of-the-art Natural Language Processing model designed for complex document understanding. It features a **custom-built Transformer architecture** initialized with weights from Google's **FLAN-T5**, combining the flexibility of from-scratch implementation with the power of modern pre-trained models.
@@ -18,32 +18,37 @@ This project is built with industry-standard MLOps practices, including configur
 ## Core Features
-*   **Abstractive Summarization:** Generates concise, coherent summaries of long-form text using encoder-decoder attention.
-*   **Emotion Classification:** Identifies emotions (Joy, Sadness, Anger, Fear, Love, Surprise) conveyed in a document.
-*   **Topic Clustering:** Classifies documents into thematic categories (World, Sports, Business, Sci/Tech).
 ## Model Architecture
 LexiMind implements a **from-scratch Transformer** with modern architectural choices:
 ### Custom Transformer Features
-- **Pre-Layer Normalization (Pre-LN):** RMSNorm applied before each sublayer for stable training
-- **FlashAttention:** Via PyTorch 2.0's `scaled_dot_product_attention` for efficient computation
-- **Learned Positional Embeddings:** Trainable position representations
-- **Multi-Head Attention:** 12 heads with 768-dimensional representations
-- **RMSNorm:** Modern normalization without bias (more efficient than LayerNorm)
 ### Pre-trained Weight Initialization
 The model loads weights from **Google's FLAN-T5-base**, which provides:
-- Strong language understanding from instruction-tuning
-- Excellent performance on summarization and classification tasks
-- Encoder-decoder architecture matching our custom implementation
 ### Multi-Task Learning
 A shared encoder-decoder backbone with task-specific heads:
-- **Summarization Head:** Language modeling head with weight tying
-- **Emotion Head:** Mean-pooled classification with dropout
-- **Topic Head:** Mean-pooled classification with dropout
 ## Technical Specifications
@@ -64,29 +69,32 @@ A shared encoder-decoder backbone with task-specific heads:
 ### Prerequisites
-*   Python 3.10+
-*   Poetry for dependency management
-*   Docker (for containerized deployment)
-*   An NVIDIA GPU with CUDA support (for training and accelerated inference)
 ### Installation
-1.  **Clone the repository:**
-    ```bash
-    git clone https://github.com/OliverPerrin/LexiMind.git
-    cd LexiMind
-    ```
-2.  **Install dependencies:**
-    ```bash
-    poetry install
-    ```
-3.  **Download and preprocess data:**
-    ```bash
-    poetry run python scripts/download_data.py
-    poetry run python scripts/preprocess_data.py
-    ```
 ## Usage
@@ -95,12 +103,13 @@ A shared encoder-decoder backbone with task-specific heads:
 All training and model parameters are managed via Hydra. Configurations are located in the `configs/` directory.
 Available configurations:
-- `model=base` - FLAN-T5-base (default, 12 layers)
-- `model=small` - Smaller model for testing (no pretrained weights)
-- `model=large` - FLAN-T5-large (24 layers, requires more VRAM)
-- `training=dev` - Quick development run
-- `training=medium` - Balanced training (~2-3 hours on RTX 4070)
-- `training=full` - Full training run
 ### Training
@@ -116,6 +125,9 @@ poetry run python scripts/train.py training=medium
 # Override parameters
 poetry run python scripts/train.py training.optimizer.lr=5e-5
 ```
 Experiments are automatically tracked with MLflow. View results with `mlflow ui`.
@@ -148,7 +160,7 @@ docker run -p 7860:7860 leximind
 ## Project Structure
-```
 ├── configs/            # Hydra configuration files
 │   ├── model/          # Model architectures (base, small, large)
 │   ├── training/       # Training configs (dev, medium, full)
@@ -169,22 +181,33 @@ docker run -p 7860:7860 leximind
 ## Code Quality
-*   **Ruff:** Fast linting and formatting
-*   **MyPy:** Static type checking
-*   **Pre-commit hooks:** Automated quality checks
 ```bash
 poetry run pre-commit install
 ```
 ## Performance Optimizations
-- **torch.compile:** JIT compilation with Inductor backend
-- **Mixed Precision:** bfloat16 training on Ampere/Ada GPUs
-- **TF32:** Enabled for RTX 30xx/40xx series
-- **KV-Cache:** Efficient autoregressive decoding
-- **FlashAttention:** Memory-efficient attention via SDPA
 ## License
-MIT License - see [LICENSE](LICENSE) for details.

 pinned: false
 ---
+## LexiMind: A Multi-Task NLP Model
 LexiMind is a state-of-the-art Natural Language Processing model designed for complex document understanding. It features a **custom-built Transformer architecture** initialized with weights from Google's **FLAN-T5**, combining the flexibility of from-scratch implementation with the power of modern pre-trained models.
 ## Core Features
+* **Abstractive Summarization:** Generates concise, coherent summaries of long-form text using encoder-decoder attention.
+* **Emotion Classification:** Identifies emotions (Joy, Sadness, Anger, Fear, Love, Surprise) conveyed in a document.
+* **Topic Clustering:** Classifies documents into thematic categories (World, Sports, Business, Sci/Tech).
 ## Model Architecture
 LexiMind implements a **from-scratch Transformer** with modern architectural choices:
 ### Custom Transformer Features
+* **Pre-Layer Normalization (Pre-LN):** RMSNorm applied before each sublayer for stable training
+* **FlashAttention:** Via PyTorch 2.0's `scaled_dot_product_attention` for efficient computation
+* **Learned Positional Embeddings:** Trainable position representations
+* **Multi-Head Attention:** 12 heads with 768-dimensional representations
+* **RMSNorm:** Modern normalization without bias (more efficient than LayerNorm)
 ### Pre-trained Weight Initialization
 The model loads weights from **Google's FLAN-T5-base**, which provides:
+* Strong language understanding from instruction-tuning
+* Excellent performance on summarization and classification tasks
+* Encoder-decoder architecture matching our custom implementation
 ### Multi-Task Learning
 A shared encoder-decoder backbone with task-specific heads:
+* **Summarization Head:** Language modeling head with weight tying
+* **Emotion Head:** Mean-pooled classification with dropout
+* **Topic Head:** Mean-pooled classification with dropout
 ## Technical Specifications
 ### Prerequisites
+* Python 3.10+
+* Poetry for dependency management
+* Docker (for containerized deployment)
+* An NVIDIA GPU with CUDA support (for training and accelerated inference)
 ### Installation
+1. **Clone the repository:**
+   ```bash
+   git clone https://github.com/OliverPerrin/LexiMind.git
+   cd LexiMind
+   ```
+2. **Install dependencies:**
+   ```bash
+   poetry install
+   ```
+3. **Download and preprocess data:**
+   ```bash
+   poetry run python scripts/download_data.py
+   poetry run python scripts/preprocess_data.py
+   ```
 ## Usage
 All training and model parameters are managed via Hydra. Configurations are located in the `configs/` directory.
 Available configurations:
+* `model=base` - FLAN-T5-base (default, 12 layers)
+* `model=small` - Smaller model for testing (no pretrained weights)
+* `model=large` - FLAN-T5-large (24 layers, requires more VRAM)
+* `training=dev` - Quick development run
+* `training=medium` - Balanced training (~2-3 hours on RTX 4070)
+* `training=full` - Full training run
 ### Training
 # Override parameters
 poetry run python scripts/train.py training.optimizer.lr=5e-5
+# Resume from a checkpoint
+poetry run python scripts/train.py training=full resume_from=checkpoints/epoch_5.pt
 ```
 Experiments are automatically tracked with MLflow. View results with `mlflow ui`.
 ## Project Structure
+```text
 ├── configs/            # Hydra configuration files
 │   ├── model/          # Model architectures (base, small, large)
 │   ├── training/       # Training configs (dev, medium, full)
 ## Code Quality
+* **Ruff:** Fast linting and formatting
+* **MyPy:** Static type checking
+* **Pytest:** Full test suite covering data, models, and training
+* **Pre-commit hooks:** Automated quality checks
 ```bash
+# Install hooks
 poetry run pre-commit install
+# Lint
+poetry run ruff check .
+# Type check
+poetry run mypy .
+# Tests
+poetry run pytest
 ```
 ## Performance Optimizations
+* **torch.compile:** JIT compilation with Inductor backend
+* **Mixed Precision:** bfloat16 training on Ampere/Ada GPUs
+* **TF32:** Enabled for RTX 30xx/40xx series
+* **KV-Cache:** Efficient autoregressive decoding
+* **FlashAttention:** Memory-efficient attention via SDPA
 ## License
+MIT License - see [LICENSE](LICENSE) for details.

configs/config.yaml CHANGED Viewed

@@ -14,5 +14,6 @@ hydra:
 checkpoint_out: "checkpoints/best.pt"
 labels_out: "artifacts/labels.json"
 history_out: "outputs/training_history.json"
 device: "cuda"
 seed: 17

 checkpoint_out: "checkpoints/best.pt"
 labels_out: "artifacts/labels.json"
 history_out: "outputs/training_history.json"
+resume_from: null
 device: "cuda"
 seed: 17

configs/model/base.yaml CHANGED Viewed

@@ -1,8 +1,10 @@
 # FLAN-T5-base architecture
-# 6 encoder layers, 6 decoder layers, 768 hidden dim
 d_model: 768
-num_encoder_layers: 6   # T5-base has 6 layers
-num_decoder_layers: 6   # T5-base has 6 layers
 num_attention_heads: 12
 ffn_dim: 2048  # T5 uses d_ff = 2048 for base model
 dropout: 0.1
@@ -10,3 +12,5 @@ activation: gated-gelu  # T5/FLAN-T5 uses gated-gelu (GELU activation with gatin
 use_pretrained: true
 pretrained_model_name: google/flan-t5-base
 use_relative_position_bias: true  # T5 uses relative position bias instead of absolute embeddings

 # FLAN-T5-base architecture
+# 12 encoder layers, 12 decoder layers, 768 hidden dim
 d_model: 768
+# Align vocab with FLAN-T5 padded size to avoid weight truncation
+vocab_size: 32128
+num_encoder_layers: 12   # T5-base has 12 layers
+num_decoder_layers: 12   # T5-base has 12 layers
 num_attention_heads: 12
 ffn_dim: 2048  # T5 uses d_ff = 2048 for base model
 dropout: 0.1
 use_pretrained: true
 pretrained_model_name: google/flan-t5-base
 use_relative_position_bias: true  # T5 uses relative position bias instead of absolute embeddings
+gradient_checkpointing: false

configs/training/dev.yaml CHANGED Viewed

@@ -1,35 +1,45 @@
 # Development/Testing Configuration for FLAN-T5-base
 # Fast iteration for debugging and testing changes
-# Training time: ~3-5 minutes on RTX 4070 12GB
 # Use: python scripts/train.py training=dev
 dataloader:
-  batch_size: 14
   shuffle: true
-  num_workers: 6
   pin_memory: true
   persistent_workers: true
-  prefetch_factor: 4
 optimizer:
   name: adamw
-  lr: 2.0e-5
   weight_decay: 0.01
-  eps: 1.0e-6
 scheduler:
   name: cosine
-  warmup_steps: 50
 trainer:
-  max_epochs: 1
   gradient_clip_norm: 1.0
-  gradient_accumulation_steps: 4
   validation_max_length: 128
   label_smoothing: 0.1
   task_weights:
     summarization: 1.0
-    emotion: 1.0
-    topic: 1.0
-  max_train_samples: 1000
-  max_val_samples: 100

 # Development/Testing Configuration for FLAN-T5-base
 # Fast iteration for debugging and testing changes
+# VRAM Usage: ~8-9GB peak (12GB available)
+# Training time: ~10-15 minutes on RTX 4070 12GB
 # Use: python scripts/train.py training=dev
 dataloader:
+  batch_size: 5  # Conservative for 12GB VRAM
   shuffle: true
+  num_workers: 4
   pin_memory: true
   persistent_workers: true
+  prefetch_factor: 2
 optimizer:
   name: adamw
+  lr: 5.0e-5  # Higher LR for faster convergence in dev
   weight_decay: 0.01
+  eps: 1.0e-8
+  betas: [0.9, 0.999]
 scheduler:
   name: cosine
+  warmup_steps: 100  # ~2% of training steps for smoother start
 trainer:
+  max_epochs: 3
   gradient_clip_norm: 1.0
+  gradient_accumulation_steps: 12  # Effective batch: 60 (5*12)
   validation_max_length: 128
   label_smoothing: 0.1
   task_weights:
     summarization: 1.0
+    emotion: 0.5
+    topic: 0.5
+  max_train_samples: 3000  # 3k samples for better validation
+  max_val_samples: 300
+  early_stopping_patience: 5  # Stop if no improvement
+  log_grad_norm_frequency: 100
+# Disable compile for faster startup in dev
+compile_encoder: false
+compile_decoder: false
+tokenizer_max_length: 512

configs/training/full.yaml CHANGED Viewed

@@ -1,33 +1,44 @@
 # Full Training Configuration for FLAN-T5-base
-# Complete training run on all data
-# Training time: ~4-6 hours on RTX 4070 12GB with inductor
 # Use: python scripts/train.py training=full
 dataloader:
-  batch_size: 14
   shuffle: true
-  num_workers: 6
   pin_memory: true
   persistent_workers: true
-  prefetch_factor: 4
 optimizer:
   name: adamw
-  lr: 2.0e-5
   weight_decay: 0.01
   eps: 1.0e-6
 scheduler:
   name: cosine
-  warmup_steps: 1000
 trainer:
-  max_epochs: 3
   gradient_clip_norm: 1.0
-  gradient_accumulation_steps: 3  # Effective batch = 42
   validation_max_length: 128
   label_smoothing: 0.1
   task_weights:
-    summarization: 1.0
     emotion: 1.0
-    topic: 1.0

 # Full Training Configuration for FLAN-T5-base
+# Complete training run on all available data
+# VRAM Usage: ~10-11GB peak (12GB available)
+# Training time: ~3-4 hours on RTX 4070 12GB with torch.compile
 # Use: python scripts/train.py training=full
 dataloader:
+  batch_size: 6  # Conservative for 12GB VRAM with torch.compile overhead
   shuffle: true
+  num_workers: 4
   pin_memory: true
   persistent_workers: true
+  prefetch_factor: 2
 optimizer:
   name: adamw
+  lr: 3.0e-5  # Higher LR with larger effective batch
   weight_decay: 0.01
   eps: 1.0e-6
+  betas: [0.9, 0.999]
 scheduler:
   name: cosine
+  warmup_steps: 1000  # ~1% warmup for stability
 trainer:
+  max_epochs: 8  # More epochs for full dataset
   gradient_clip_norm: 1.0
+  gradient_accumulation_steps: 16  # Effective batch: 96 (6*16)
   validation_max_length: 128
   label_smoothing: 0.1
   task_weights:
+    summarization: 1.5  # Prioritize summarization quality
     emotion: 1.0
+    topic: 0.8
+  # No max_samples - use full dataset
+  early_stopping_patience: 3  # Stop if plateaus
+  log_grad_norm_frequency: 100
+# Enable torch.compile for maximum speed
+compile_encoder: true
+compile_decoder: true
+tokenizer_max_length: 512

configs/training/medium.yaml CHANGED Viewed

@@ -1,35 +1,45 @@
 # Medium Configuration for FLAN-T5-base
 # Balanced approach - good results in reasonable time
-# Training time: ~1.5-2 hours on RTX 4070 12GB with inductor
 # Use: python scripts/train.py training=medium
 dataloader:
-  batch_size: 14
   shuffle: true
-  num_workers: 6
   pin_memory: true
   persistent_workers: true
-  prefetch_factor: 4
 optimizer:
   name: adamw
-  lr: 3.0e-5
   weight_decay: 0.01
   eps: 1.0e-6
 scheduler:
   name: cosine
-  warmup_steps: 300
 trainer:
-  max_epochs: 3
   gradient_clip_norm: 1.0
-  gradient_accumulation_steps: 3  # Effective batch = 42
   validation_max_length: 128
   label_smoothing: 0.1
   task_weights:
-    summarization: 1.0
-    emotion: 1.0
-    topic: 1.0
-  max_train_samples: 50000
-  max_val_samples: 5000

 # Medium Configuration for FLAN-T5-base
 # Balanced approach - good results in reasonable time
+# VRAM Usage: ~9-10GB peak (12GB available)
+# Training time: ~45-60 minutes on RTX 4070 12GB with torch.compile
 # Use: python scripts/train.py training=medium
 dataloader:
+  batch_size: 6  # Conservative for 12GB VRAM with torch.compile
   shuffle: true
+  num_workers: 4
   pin_memory: true
   persistent_workers: true
+  prefetch_factor: 2
 optimizer:
   name: adamw
+  lr: 3.0e-5  # Balanced LR for quality
   weight_decay: 0.01
   eps: 1.0e-6
+  betas: [0.9, 0.999]
 scheduler:
   name: cosine
+  warmup_steps: 500  # ~2% warmup for 25k steps
 trainer:
+  max_epochs: 5  # More epochs for better convergence
   gradient_clip_norm: 1.0
+  gradient_accumulation_steps: 12  # Effective batch: 72 (6*12)
   validation_max_length: 128
   label_smoothing: 0.1
   task_weights:
+    summarization: 1.2  # Slightly prioritize summarization
+    emotion: 0.8
+    topic: 0.8
+  max_train_samples: 25000  # 25k samples - good balance
+  max_val_samples: 2500
+  early_stopping_patience: 3
+  log_grad_norm_frequency: 100
+# Enable torch.compile for 1.5-2x speedup
+compile_encoder: true
+compile_decoder: true
+tokenizer_max_length: 512

docs/api.md DELETED Viewed

@@ -1,79 +0,0 @@
-# API & CLI Documentation
-## FastAPI Service
-The FastAPI application is defined in `src/api/app.py` and wires routes from
-`src/api/routes.py`. All dependencies resolve through `src/api/dependencies.py`, which lazily constructs the shared inference pipeline.
-### POST `/summarize`
-- **Request Body** (`SummaryRequest`):
-  ```json
-  {
-    "text": "Your input document"
-  }
-  ```
-- **Response** (`SummaryResponse`):
-  ```json
-  {
-    "summary": "Generated abstractive summary",
-    "emotion_labels": ["joy", "surprise"],
-    "emotion_scores": [0.91, 0.63],
-    "topic": "news",
-    "topic_confidence": 0.82
-  }
-  ```
-- **Behaviour:**
-  1. Text is preprocessed through `TextPreprocessor` (with optional sklearn transformer if configured).
-  2. The multitask model generates a summary via greedy decoding.
-  3. Emotion and topic heads produce logits which are converted to probabilities and mapped to
-     human-readable labels using `artifacts/labels.json`.
-  4. Results are returned as structured JSON suitable for a future Gradio interface.
-### Error Handling
-- If the checkpoint or label metadata is missing, the dependency raises an HTTP 503 error with
-  an explanatory message.
-- Validation errors (missing `text`) are handled automatically by FastAPI/Pydantic.
-## Command-Line Interface
-`scripts/inference.py` provides a CLI that mirrors the API behaviour.
-### Usage
-```bash
-python scripts/inference.py "Document to analyse" \
-  --checkpoint checkpoints/best.pt \
-  --labels artifacts/labels.json \
-  --tokenizer artifacts/hf_tokenizer \
-  --model-config configs/model/base.yaml \
-  --device cpu
-```
-Options:
-- `text` – zero or more positional arguments. If omitted, use `--file` to point to a newline
-  delimited text file.
-- `--file` – optional path containing one text per line.
-- `--checkpoint` – path to the trained model weights.
-- `--labels` – JSON containing emotion/topic vocabularies (defaults to `artifacts/labels.json`).
-- `--tokenizer` – optional tokenizer directory; defaults to the exported artifact if present.
-- `--model-config` – YAML describing the architecture.
-- `--device` – `cpu` or `cuda`. Passing `cuda` attempts to run inference on GPU.
-- `--summary-max-length` – overrides the default maximum generation length.
-### Output
-The CLI prints a JSON array where each entry contains the original text, summary, emotion labels
-with scores, and topic prediction. This format is identical to the REST response, facilitating
-integration tests and future Gradio UI rendering.
-## Future Gradio UI
-- The planned UI will call the same inference pipeline and display results interactively.
-- Given the response schema, the UI can show:
-  - Generated summary text.
-  - Emotion chips with probability bars.
-  - Topic confidence gauges.
-  - Placeholder panel for attention heatmaps and explanations.
-- Once implemented, documentation updates will add a `docs/ui.md` section and screenshots under
-  `docs/images/`.
-## Testing
-- `tests/test_api/test_routes.py` stubs the pipeline to ensure response fields and dependency
-  overrides behave as expected.
-- `tests/test_inference/test_pipeline.py` validates pipeline methods end-to-end with dummy models,
-  guaranteeing API and CLI consumers receive consistent payload shapes.

docs/architecture.md CHANGED Viewed

@@ -1,6 +1,7 @@
 # LexiMind Architecture
 ## Overview
 LexiMind couples a from-scratch Transformer implementation with a modern data and inference stack. The project consists of three major layers:
 1. **Data & Preprocessing** – lightweight text cleaning built on top of scikit-learn
@@ -15,6 +16,7 @@ LexiMind couples a from-scratch Transformer implementation with a modern data an
 The custom Transformer is designed with **modern architectural choices** while maintaining compatibility with pre-trained weights from Google's **FLAN-T5**.
 ### Architecture Highlights
 - **Pre-Layer Normalization (Pre-LN):** RMSNorm applied *before* each sublayer for stable training
 - **RMSNorm:** More efficient than LayerNorm (no mean computation, no bias parameters)
 - **FlashAttention:** Via PyTorch 2.0's `F.scaled_dot_product_attention` for O(N) memory
@@ -22,7 +24,9 @@ The custom Transformer is designed with **modern architectural choices** while m
 - **Multi-Head Attention:** 12 heads with optional LoRA adapters and RoPE support
 ### Weight Loading from FLAN-T5
 The `factory.py` module loads weights from FLAN-T5-base, which uses a compatible Pre-LN architecture:
 - **Token embeddings:** Shared between encoder and decoder
 - **Attention projections:** Q, K, V, O weights (bias initialized to zero since T5 has no attention bias)
 - **FFN weights:** `wi_1` → `linear1`, `wo` → `linear2` (T5 uses gated FFN; we use the up/down projections)
@@ -32,6 +36,7 @@ The `factory.py` module loads weights from FLAN-T5-base, which uses a compatible
 **Note:** T5 uses *relative position bias* computed in attention, not absolute embeddings. Our learned positional embeddings are randomly initialized and train quickly during fine-tuning.
 ### File Structure
 - `src/models/encoder.py` – TransformerEncoder with Pre-LN RMSNorm blocks
 - `src/models/decoder.py` – TransformerDecoder with KV-cache for efficient generation
 - `src/models/attention.py` – Multi-Head Attention with FlashAttention, LoRA, and RoPE support
@@ -40,16 +45,19 @@ The `factory.py` module loads weights from FLAN-T5-base, which uses a compatible
 - `src/models/factory.py` – Builds models and loads FLAN-T5 weights
 ## Data, Tokenization, and Preprocessing
 - `src/data/tokenization.py` wraps `AutoTokenizer` (configured for FLAN-T5) to provide tensor-aware batching and helper utilities for decoder input shifting.
 - `src/data/preprocessing.py` introduces `TextPreprocessor`, layering a `BasicTextCleaner` with optional scikit-learn transformers.
 - `src/data/dataset.py` and `src/data/dataloader.py` define strongly typed dataset containers and collators.
 ### T5 Tokenizer Differences
 - **Vocab size:** 32,128 tokens (SentencePiece)
 - **Special tokens:** pad=0, eos=1 (no explicit BOS; decoder starts with pad token)
 - **Subword tokenization:** Unigram-based (vs BART's BPE)
 ## Training Pipeline
 - `src/training/trainer.py` coordinates multi-task optimization with:
   - Mixed precision training (bfloat16 on Ampere/Ada GPUs)
   - Gradient accumulation for larger effective batch sizes
@@ -58,12 +66,14 @@ The `factory.py` module loads weights from FLAN-T5-base, which uses a compatible
 - Metrics in `src/training/metrics.py` include accuracy, multi-label F1, and ROUGE-like overlap
 ## Inference & Serving
 - `src/inference/pipeline.py` exposes summarization, emotion, and topic predictions with shared pre-processing, generation, and thresholding logic.
 - `src/inference/factory.py` rebuilds the full pipeline using the exported tokenizer artifact
 - The CLI (`scripts/inference.py`) drives the pipeline from the command line
 - Gradio demo (`scripts/demo_gradio.py`) provides a web interface
 ## Key Decisions
 - **Custom Transformer + Pre-trained Weights:** Building from scratch demonstrates deep understanding while leveraging FLAN-T5's language knowledge
 - **Pre-LN RMSNorm:** Modern architecture used by LLaMA, T5 v1.1, and other 2023-2025 models
 - **Tokenizer Artifact Preference:** Inference favors `artifacts/hf_tokenizer` for reproducibility

 # LexiMind Architecture
 ## Overview
 LexiMind couples a from-scratch Transformer implementation with a modern data and inference stack. The project consists of three major layers:
 1. **Data & Preprocessing** – lightweight text cleaning built on top of scikit-learn
 The custom Transformer is designed with **modern architectural choices** while maintaining compatibility with pre-trained weights from Google's **FLAN-T5**.
 ### Architecture Highlights
 - **Pre-Layer Normalization (Pre-LN):** RMSNorm applied *before* each sublayer for stable training
 - **RMSNorm:** More efficient than LayerNorm (no mean computation, no bias parameters)
 - **FlashAttention:** Via PyTorch 2.0's `F.scaled_dot_product_attention` for O(N) memory
 - **Multi-Head Attention:** 12 heads with optional LoRA adapters and RoPE support
 ### Weight Loading from FLAN-T5
 The `factory.py` module loads weights from FLAN-T5-base, which uses a compatible Pre-LN architecture:
 - **Token embeddings:** Shared between encoder and decoder
 - **Attention projections:** Q, K, V, O weights (bias initialized to zero since T5 has no attention bias)
 - **FFN weights:** `wi_1` → `linear1`, `wo` → `linear2` (T5 uses gated FFN; we use the up/down projections)
 **Note:** T5 uses *relative position bias* computed in attention, not absolute embeddings. Our learned positional embeddings are randomly initialized and train quickly during fine-tuning.
 ### File Structure
 - `src/models/encoder.py` – TransformerEncoder with Pre-LN RMSNorm blocks
 - `src/models/decoder.py` – TransformerDecoder with KV-cache for efficient generation
 - `src/models/attention.py` – Multi-Head Attention with FlashAttention, LoRA, and RoPE support
 - `src/models/factory.py` – Builds models and loads FLAN-T5 weights
 ## Data, Tokenization, and Preprocessing
 - `src/data/tokenization.py` wraps `AutoTokenizer` (configured for FLAN-T5) to provide tensor-aware batching and helper utilities for decoder input shifting.
 - `src/data/preprocessing.py` introduces `TextPreprocessor`, layering a `BasicTextCleaner` with optional scikit-learn transformers.
 - `src/data/dataset.py` and `src/data/dataloader.py` define strongly typed dataset containers and collators.
 ### T5 Tokenizer Differences
 - **Vocab size:** 32,128 tokens (SentencePiece)
 - **Special tokens:** pad=0, eos=1 (no explicit BOS; decoder starts with pad token)
 - **Subword tokenization:** Unigram-based (vs BART's BPE)
 ## Training Pipeline
 - `src/training/trainer.py` coordinates multi-task optimization with:
   - Mixed precision training (bfloat16 on Ampere/Ada GPUs)
   - Gradient accumulation for larger effective batch sizes
 - Metrics in `src/training/metrics.py` include accuracy, multi-label F1, and ROUGE-like overlap
 ## Inference & Serving
 - `src/inference/pipeline.py` exposes summarization, emotion, and topic predictions with shared pre-processing, generation, and thresholding logic.
 - `src/inference/factory.py` rebuilds the full pipeline using the exported tokenizer artifact
 - The CLI (`scripts/inference.py`) drives the pipeline from the command line
 - Gradio demo (`scripts/demo_gradio.py`) provides a web interface
 ## Key Decisions
 - **Custom Transformer + Pre-trained Weights:** Building from scratch demonstrates deep understanding while leveraging FLAN-T5's language knowledge
 - **Pre-LN RMSNorm:** Modern architecture used by LLaMA, T5 v1.1, and other 2023-2025 models
 - **Tokenizer Artifact Preference:** Inference favors `artifacts/hf_tokenizer` for reproducibility

docs/training.md DELETED Viewed

@@ -1,80 +0,0 @@
-# Training Procedure
-## Data Sources
-- **Summarization** – expects JSONL files with `source` and `summary` fields under
-  `data/processed/summarization`.
-- **Emotion Classification** – multi-label samples loaded from JSONL files with
-  `text` and `emotions` arrays. The dataset owns a `MultiLabelBinarizer` for consistent encoding.
-- **Topic Classification** – single-label categorical samples with `text` and `topic` fields, encoded via `LabelEncoder`.
-Paths and tokenizer defaults are configured in `configs/data/datasets.yaml`. The tokenizer section chooses the Hugging Face backbone (`google/flan-t5-base` by default) and maximum length. Gutenberg book downloads are controlled via the `downloads.books` list (each entry includes `name`, `url`, and `output`).
-## Dataloaders & Collators
-- `SummarizationCollator` encodes encoder/decoder inputs, prepares decoder input IDs via `Tokenizer.prepare_decoder_inputs`, and masks padding tokens with `-100` for loss computation. Note: FLAN-T5 uses `pad_token_id=0` and `decoder_start_token_id=0`.
-- `EmotionCollator` applies the dataset's `MultiLabelBinarizer`, returning dense float tensors suitable for `BCEWithLogitsLoss`.
-- `TopicCollator` emits integer class IDs via the dataset's `LabelEncoder` for `CrossEntropyLoss`.
-These collators keep all tokenization centralized, reducing duplication and making it easy to swap in additional sklearn transformations through `TextPreprocessor` should we wish to extend cleaning or normalization.
-## Model Assembly
-- `src/models/factory.build_multitask_model` rebuilds the encoder, decoder, and heads from the tokenizer metadata and YAML config. This factory is used both during training and inference to eliminate drift between environments.
-- Pretrained weights are loaded from FLAN-T5 using `_load_t5_weights()`, which transfers:
-  - Shared token embeddings (with proper scaling)
-  - Attention projections (q, k, v, o) for all encoder/decoder layers
-  - FFN weights (wi_0, wi_1 for gated activation, wo for output)
-  - Layer normalization parameters (mapped from T5's RMSNorm)
-- The model wraps:
-  - Transformer encoder/decoder stacks with **Pre-LN RMSNorm** architecture.
-  - LM head tied to decoder embeddings for summarization.
-  - Mean-pooled classification heads for emotion and topic tasks.
-## Optimisation Loop
-- `src/training/trainer.Trainer` orchestrates multi-task training.
-  - Cross-entropy is used for summarization (seq2seq logits vs. shifted labels).
-  - `BCEWithLogitsLoss` handles multi-label emotions.
-  - `CrossEntropyLoss` handles topic classification.
-- Gradient clipping ensures stability, and per-task weights can be configured via
-  `TrainerConfig.task_weights` to balance gradients if needed.
-- Metrics tracked per task:
-  - **Summarization** – ROUGE-like overlap metric (`training.metrics.rouge_like`).
-  - **Emotion** – micro F1 score for multi-label predictions.
-  - **Topic** – categorical accuracy.
-## Checkpoints & Artifacts
-- `src/utils/io.save_state` stores model weights; checkpoints live under `checkpoints/`.
-- `artifacts/labels.json` captures the ordered emotion/topic vocabularies immediately after
-  training. This file is required for inference so class indices map back to human-readable labels.
-- The tokenizer is exported to `artifacts/hf_tokenizer/` for reproducible vocabularies using `scripts/export_tokenizer.py`.
-## Running Training
-1. Ensure processed datasets are available (see `data/processed/` structure).
-2. Export the FLAN-T5 tokenizer: `python scripts/export_tokenizer.py`
-3. Choose a configuration (e.g., `configs/training/dev.yaml`) for hyperparameters and data splits.
-4. Instantiate the tokenizer via `TokenizerConfig` and build datasets/dataloaders.
-5. Use `build_multitask_model` to construct the model with FLAN-T5 weights, create an optimizer, and run
-   `Trainer.fit(train_loaders, val_loaders)`.
-6. Save checkpoints and update `artifacts/labels.json` with the dataset label order.
-```bash
-# Quick start
-python scripts/export_tokenizer.py          # Export FLAN-T5 tokenizer
-python scripts/train.py training=dev        # Run dev training (2 epochs)
-python scripts/train.py training=medium     # Run medium training (5 epochs)
-python scripts/train.py training=full       # Run full training (10 epochs)
-```
-## Why FLAN-T5?
-LexiMind's custom Transformer uses **Pre-LN (normalization before sublayers)** with **RMSNorm**. This modern architecture choice provides:
-- Better gradient flow during training
-- Improved training stability
-- Faster convergence
-FLAN-T5 uses the same Pre-LN RMSNorm architecture, making weight transfer straightforward. Previously used BART (Post-LN LayerNorm) had a fundamental architectural mismatch that caused training issues.
-> **Note:** T5's relative position bias is NOT transferred. The model uses learned positional encodings which train from scratch. This is fine since positional information is task-specific.
-## Future Enhancements
-- Integrate curriculum scheduling or task-balanced sampling once empirical results dictate.
-- Capture attention maps during training to support visualization in the planned Gradio UI.
-- Leverage the optional `sklearn_transformer` hook in `TextPreprocessor` for lemmatization or domain-specific normalization when datasets require it.
-- Experiment with FLAN-T5-large for improved performance on longer sequences.

outputs/training_history.json CHANGED Viewed

@@ -1,59 +1,59 @@
 {
-  "train_epoch_1": {
-    "summarization_loss": 5.035701327740604,
-    "summarization_rouge_like": 0.16390742100245742,
-    "emotion_loss": 0.21049204537547025,
-    "emotion_f1": 0.002655381929628719,
-    "topic_loss": 1.176912516972419,
-    "topic_accuracy": 0.6581478229164939,
-    "total_loss": 6.423106049642868,
-    "epoch": 1.0
   },
-  "val_epoch_1": {
-    "summarization_loss": 4.6882993674363105,
-    "summarization_rouge_like": 0.19405199466966144,
-    "emotion_loss": 0.15183634538985658,
-    "emotion_f1": 0.0016098967067287486,
-    "topic_loss": 0.8788343331143526,
-    "topic_accuracy": 0.7251652262328394,
-    "epoch": 1.0
   },
-  "train_epoch_2": {
-    "summarization_loss": 4.561023824777751,
-    "summarization_rouge_like": 0.20945581532076613,
-    "emotion_loss": 0.14958151845580364,
-    "emotion_f1": 0.008022325540815077,
-    "topic_loss": 0.8585619787599033,
-    "topic_accuracy": 0.7299605100316837,
-    "total_loss": 5.569167470253677,
-    "epoch": 2.0
   },
-  "val_epoch_2": {
-    "summarization_loss": 4.335443331423179,
-    "summarization_rouge_like": 0.2383154143354784,
-    "emotion_loss": 0.1478777239331147,
-    "emotion_f1": 0.010150822387259202,
-    "topic_loss": 0.841049696600522,
-    "topic_accuracy": 0.7359938993390932,
-    "epoch": 2.0
   },
-  "train_epoch_3": {
-    "summarization_loss": 4.332563984521343,
-    "summarization_rouge_like": 0.24358268281949097,
-    "emotion_loss": 0.14520242059475916,
-    "emotion_f1": 0.026584760984350638,
-    "topic_loss": 0.8084657974773926,
-    "topic_accuracy": 0.7434995609372882,
-    "total_loss": 5.286232347914138,
-    "epoch": 3.0
   },
-  "val_epoch_3": {
-    "summarization_loss": 4.0994785383502785,
-    "summarization_rouge_like": 0.2839536633314319,
-    "emotion_loss": 0.14214695994858215,
-    "emotion_f1": 0.028164719230763854,
-    "topic_loss": 0.8218616072552484,
-    "topic_accuracy": 0.7413319776309091,
-    "epoch": 3.0
   }
 }

 {
+  "train_epoch_6": {
+    "summarization_loss": 3.2071112584752606,
+    "summarization_rouge_like": 0.41666206128984185,
+    "emotion_loss": 0.13381094067425187,
+    "emotion_f1": 0.1527181073975268,
+    "topic_loss": 0.6847172836312407,
+    "topic_accuracy": 0.7834830254758819,
+    "total_loss": 5.492251664781721,
+    "epoch": 6.0
   },
+  "val_epoch_6": {
+    "summarization_loss": 2.988837990901862,
+    "summarization_rouge_like": 0.4475286348323649,
+    "emotion_loss": 0.1262940275061054,
+    "emotion_f1": 0.19359053170564663,
+    "topic_loss": 0.7910004459155627,
+    "topic_accuracy": 0.754854122191724,
+    "epoch": 6.0
   },
+  "train_epoch_7": {
+    "summarization_loss": 3.184010818695097,
+    "summarization_rouge_like": 0.41903763419721,
+    "emotion_loss": 0.12498181367997213,
+    "emotion_f1": 0.2043521878681856,
+    "topic_loss": 0.6483695249464139,
+    "topic_accuracy": 0.796684177822936,
+    "total_loss": 5.419693668500609,
+    "epoch": 7.0
   },
+  "val_epoch_7": {
+    "summarization_loss": 2.985372142407835,
+    "summarization_rouge_like": 0.44758863369550994,
+    "emotion_loss": 0.1185748163268729,
+    "emotion_f1": 0.2514045691051182,
+    "topic_loss": 0.7817700606483663,
+    "topic_accuracy": 0.7554132357426027,
+    "epoch": 7.0
   },
+  "train_epoch_8": {
+    "summarization_loss": 3.171688149997974,
+    "summarization_rouge_like": 0.4206951155149097,
+    "emotion_loss": 0.12107599671589805,
+    "emotion_f1": 0.2286830931525678,
+    "topic_loss": 0.6216138880150013,
+    "topic_accuracy": 0.8049539626051729,
+    "total_loss": 5.375899340986727,
+    "epoch": 8.0
   },
+  "val_epoch_8": {
+    "summarization_loss": 2.984391659270994,
+    "summarization_rouge_like": 0.44770155741256373,
+    "emotion_loss": 0.11704520378562873,
+    "emotion_f1": 0.26809326239605075,
+    "topic_loss": 0.7841400383105634,
+    "topic_accuracy": 0.7546508081732227,
+    "epoch": 8.0
   }
 }

scripts/demo_gradio.py CHANGED Viewed

@@ -4,10 +4,10 @@ Gradio demo for LexiMind multi-task NLP model.
 Showcases the model's capabilities across three tasks:
 - Summarization: Generates concise summaries of input text
 - Emotion Detection: Multi-label emotion classification
-- Topic Classification: Categorizes text into news topics
 Author: Oliver Perrin
-Date: 2025-12-04
 """
 from __future__ import annotations
@@ -38,24 +38,12 @@ logger = get_logger(__name__)
 OUTPUTS_DIR = PROJECT_ROOT / "outputs"
 EVAL_REPORT_PATH = OUTPUTS_DIR / "evaluation_report.json"
 SAMPLE_TEXTS = [
-    (
-        "Artificial intelligence is rapidly transforming technology. "
-        "Machine learning algorithms process vast amounts of data, identifying "
-        "patterns with unprecedented accuracy. From healthcare to finance, AI is "
-        "revolutionizing industries worldwide."
-    ),
-    (
-        "The team's incredible comeback in the final quarter left fans in tears of joy. "
-        "After trailing by 20 points, they scored three consecutive touchdowns to secure "
-        "their first championship victory in over a decade."
-    ),
-    (
-        "Global markets tumbled today as investors reacted to rising inflation concerns. "
-        "The Federal Reserve hinted at potential interest rate hikes, sending shockwaves "
-        "through technology and banking sectors."
-    ),
 ]
 # --------------- Pipeline Management ---------------
@@ -94,27 +82,62 @@ def get_pipeline():
 def analyze(text: str) -> tuple[str, str, str]:
     """Run all three tasks and return formatted results."""
     if not text or not text.strip():
-        return "Enter text above", "", ""
     try:
         pipe = get_pipeline()
         # Run tasks
-        summary = pipe.summarize([text], max_length=128)[0].strip() or "(empty)"
-        emotions = pipe.predict_emotions([text], threshold=0.5)[0]
         topic = pipe.predict_topics([text])[0]
-        # Format emotions
         if emotions.labels:
-            emotion_str = " • ".join(
-                f"**{lbl}** ({score:.0%})"
-                for lbl, score in zip(emotions.labels, emotions.scores, strict=True)
-            )
         else:
-            emotion_str = "No strong emotions detected"
         # Format topic
-        topic_str = f"**{topic.label}** ({topic.confidence:.0%})"
         return summary, emotion_str, topic_str
@@ -125,75 +148,138 @@ def analyze(text: str) -> tuple[str, str, str]:
 def load_metrics() -> str:
     """Load evaluation metrics and format as markdown."""
-    if not EVAL_REPORT_PATH.exists():
-        return "No evaluation report found."
-    try:
-        with open(EVAL_REPORT_PATH) as f:
-            r = json.load(f)
-        return f"""
-### Overall Performance
-| Task | Metric | Score |
-|------|--------|-------|
-| **Emotion** | F1 Macro | **{r["emotion"]["f1_macro"]:.1%}** |
-| **Topic** | Accuracy | **{r["topic"]["accuracy"]:.1%}** |
-| **Summarization** | ROUGE-Like | {r["summarization"]["rouge_like"]:.1%} |
-| **Summarization** | BLEU | {r["summarization"]["bleu"]:.1%} |
-### Topic Classification (per-class)
-| Category | Precision | Recall | F1 |
-|----------|-----------|--------|-----|
-| Business | {r["topic"]["classification_report"]["Business"]["precision"]:.1%} | {r["topic"]["classification_report"]["Business"]["recall"]:.1%} | {r["topic"]["classification_report"]["Business"]["f1-score"]:.1%} |
-| Sci/Tech | {r["topic"]["classification_report"]["Sci/Tech"]["precision"]:.1%} | {r["topic"]["classification_report"]["Sci/Tech"]["recall"]:.1%} | {r["topic"]["classification_report"]["Sci/Tech"]["f1-score"]:.1%} |
-| Sports | {r["topic"]["classification_report"]["Sports"]["precision"]:.1%} | {r["topic"]["classification_report"]["Sports"]["recall"]:.1%} | {r["topic"]["classification_report"]["Sports"]["f1-score"]:.1%} |
-| World | {r["topic"]["classification_report"]["World"]["precision"]:.1%} | {r["topic"]["classification_report"]["World"]["recall"]:.1%} | {r["topic"]["classification_report"]["World"]["f1-score"]:.1%} |
-"""
-    except Exception as e:
-        return f"Error loading metrics: {e}"
 # --------------- Gradio Interface ---------------
 with gr.Blocks(
-    title="LexiMind Demo",
     theme=gr.themes.Soft(),
-    css=".output-box { min-height: 80px; }",
 ) as demo:
     gr.Markdown(
         """
         # 🧠 LexiMind
         ### Multi-Task Transformer for Document Analysis
-        A custom encoder-decoder Transformer trained on summarization, emotion detection,
-        and topic classification. Built from scratch with PyTorch.
         """
     )
     # --------------- Try It Tab ---------------
     with gr.Tab("🚀 Try It"):
         with gr.Row():
-            with gr.Column(scale=2):
                 text_input = gr.Textbox(
-                    label="Input Text",
-                    lines=5,
-                    placeholder="Enter text to analyze...",
                     value=SAMPLE_TEXTS[0],
                 )
                 with gr.Row():
-                    analyze_btn = gr.Button("Analyze", variant="primary", scale=2)
-                    gr.Examples(
-                        examples=[[t] for t in SAMPLE_TEXTS],
-                        inputs=text_input,
-                        label="Examples",
-                    )
             with gr.Column(scale=2):
-                summary_out = gr.Textbox(label="📝 Summary", lines=3, elem_classes="output-box")
-                emotion_out = gr.Markdown(label="😊 Emotions")
-                topic_out = gr.Markdown(label="📂 Topic")
         analyze_btn.click(
             fn=analyze,
@@ -203,9 +289,35 @@ with gr.Blocks(
     # --------------- Metrics Tab ---------------
     with gr.Tab("📊 Metrics"):
-        gr.Markdown(load_metrics())
-        gr.Markdown("### Confusion Matrix")
-        gr.Image(str(OUTPUTS_DIR / "topic_confusion_matrix.png"), label="Topic Classification")
     # --------------- Architecture Tab ---------------
     with gr.Tab("🔧 Architecture"):
@@ -213,28 +325,34 @@ with gr.Blocks(
             """
             ### Model Architecture
-            - **Base**: Custom Transformer (encoder-decoder)
-            - **Initialized from**: FLAN-T5-base weights
-            - **Encoder**: 6 layers, 768 hidden dim, 12 attention heads
-            - **Decoder**: 6 layers with cross-attention
-            - **Task Heads**: Classification heads for emotion/topic
-            ### Training
-            - **Optimizer**: AdamW with cosine LR schedule
-            - **Mixed Precision**: bfloat16 with TF32
-            - **Compilation**: torch.compile with inductor backend
             """
         )
-        with gr.Row():
-            gr.Image(
-                str(OUTPUTS_DIR / "attention_visualization.png"),
-                label="Self-Attention Pattern",
-            )
-            gr.Image(
-                str(OUTPUTS_DIR / "positional_encoding_heatmap.png"),
-                label="Positional Encodings",
-            )
     # --------------- About Tab ---------------
     with gr.Tab("ℹ️ About"):
@@ -242,22 +360,28 @@ with gr.Blocks(
             """
             ### About LexiMind
-            LexiMind is a multi-task NLP model designed to demonstrate end-to-end
-            machine learning engineering skills:
-            - **Custom Transformer** implementation from scratch
-            - **Multi-task learning** with shared encoder
-            - **Production-ready** inference pipeline
-            - **Comprehensive evaluation** with multiple metrics
             ### Links
             - 🔗 [GitHub Repository](https://github.com/OliverPerrin/LexiMind)
-            - 🤗 [HuggingFace Space](https://huggingface.co/spaces/OliverPerrin/LexiMind)
-            ### Author
-            **Oliver Perrin** - Machine Learning Engineer
             """
         )

 Showcases the model's capabilities across three tasks:
 - Summarization: Generates concise summaries of input text
 - Emotion Detection: Multi-label emotion classification
+- Topic Classification: Categorizes text into topics
 Author: Oliver Perrin
+Date: 2025-12-05
 """
 from __future__ import annotations
 OUTPUTS_DIR = PROJECT_ROOT / "outputs"
 EVAL_REPORT_PATH = OUTPUTS_DIR / "evaluation_report.json"
+TRAINING_HISTORY_PATH = OUTPUTS_DIR / "training_history.json"
 SAMPLE_TEXTS = [
+    "Global markets tumbled today as investors reacted to rising inflation concerns. The Federal Reserve hinted at potential interest rate hikes, sending shockwaves through technology and banking sectors. Analysts predict continued volatility as economic uncertainty persists.",
+    "Scientists at MIT have developed a breakthrough quantum computing chip that operates at room temperature. This advancement could revolutionize drug discovery, cryptography, and artificial intelligence. The research team published their findings in Nature.",
+    "The championship game ended in dramatic fashion as the underdog team scored in the final seconds to secure victory. Fans rushed the field in celebration, marking the team's first title in 25 years.",
 ]
 # --------------- Pipeline Management ---------------
 def analyze(text: str) -> tuple[str, str, str]:
     """Run all three tasks and return formatted results."""
     if not text or not text.strip():
+        return "Please enter text above to analyze.", "", ""
     try:
         pipe = get_pipeline()
         # Run tasks
+        summary = pipe.summarize([text], max_length=128)[0].strip()
+        if not summary:
+            summary = "(Unable to generate summary)"
+        emotions = pipe.predict_emotions([text], threshold=0.3)[0]  # Lower threshold
         topic = pipe.predict_topics([text])[0]
+        # Format emotions with emoji
+        emotion_emoji = {
+            "joy": "😊",
+            "love": "❤️",
+            "anger": "😠",
+            "fear": "😨",
+            "sadness": "😢",
+            "surprise": "😲",
+            "neutral": "😐",
+            "admiration": "🤩",
+            "amusement": "😄",
+            "annoyance": "😤",
+            "approval": "👍",
+            "caring": "🤗",
+            "confusion": "😕",
+            "curiosity": "🤔",
+            "desire": "😍",
+            "disappointment": "😞",
+            "disapproval": "👎",
+            "disgust": "🤢",
+            "embarrassment": "😳",
+            "excitement": "🎉",
+            "gratitude": "🙏",
+            "grief": "😭",
+            "nervousness": "��",
+            "optimism": "🌟",
+            "pride": "🦁",
+            "realization": "💡",
+            "relief": "😌",
+            "remorse": "😔",
+        }
         if emotions.labels:
+            emotion_parts = []
+            for lbl, score in zip(emotions.labels[:5], emotions.scores[:5], strict=False):
+                emoji = emotion_emoji.get(lbl.lower(), "•")
+                emotion_parts.append(f"{emoji} **{lbl.title()}** ({score:.0%})")
+            emotion_str = "\n".join(emotion_parts)
         else:
+            emotion_str = "😐 No strong emotions detected"
         # Format topic
+        topic_str = f"**{topic.label}**\n\nConfidence: {topic.confidence:.0%}"
         return summary, emotion_str, topic_str
 def load_metrics() -> str:
     """Load evaluation metrics and format as markdown."""
+    # Load evaluation report
+    eval_metrics = {}
+    if EVAL_REPORT_PATH.exists():
+        try:
+            with open(EVAL_REPORT_PATH) as f:
+                eval_metrics = json.load(f)
+        except Exception:
+            pass
+    # Load training history
+    train_metrics = {}
+    if TRAINING_HISTORY_PATH.exists():
+        try:
+            with open(TRAINING_HISTORY_PATH) as f:
+                train_metrics = json.load(f)
+        except Exception:
+            pass
+    # Get final validation metrics
+    val_final = train_metrics.get("val_epoch_3", {})
+    md = """
+## 📈 Model Performance
+### Training Results (3 Epochs)
+| Task | Metric | Final Score |
+|------|--------|-------------|
+| **Topic Classification** | Accuracy | **{topic_acc:.1%}** |
+| **Emotion Detection** | F1 (training) | {emo_f1:.1%} |
+| **Summarization** | ROUGE-like | {rouge:.1%} |
+### Evaluation Results
+| Metric | Value |
+|--------|-------|
+| Topic Accuracy | **{eval_topic:.1%}** |
+| Emotion F1 (macro) | {eval_emo:.1%} |
+| ROUGE-like | {eval_rouge:.1%} |
+| BLEU | {eval_bleu:.3f} |
+---
+### Topic Classification Details
+| Category | Precision | Recall | F1 |
+|----------|-----------|--------|-----|
+""".format(
+        topic_acc=val_final.get("topic_accuracy", 0),
+        emo_f1=val_final.get("emotion_f1", 0),
+        rouge=val_final.get("summarization_rouge_like", 0),
+        eval_topic=eval_metrics.get("topic", {}).get("accuracy", 0),
+        eval_emo=eval_metrics.get("emotion", {}).get("f1_macro", 0),
+        eval_rouge=eval_metrics.get("summarization", {}).get("rouge_like", 0),
+        eval_bleu=eval_metrics.get("summarization", {}).get("bleu", 0),
+    )
+    # Add per-class metrics
+    topic_report = eval_metrics.get("topic", {}).get("classification_report", {})
+    for cat, metrics in topic_report.items():
+        if cat in ["macro avg", "weighted avg", "micro avg"]:
+            continue
+        if isinstance(metrics, dict):
+            md += f"| {cat} | {metrics.get('precision', 0):.1%} | {metrics.get('recall', 0):.1%} | {metrics.get('f1-score', 0):.1%} |\n"
+    return md
+def get_viz_path(filename: str) -> str | None:
+    """Get visualization path if file exists."""
+    path = OUTPUTS_DIR / filename
+    return str(path) if path.exists() else None
 # --------------- Gradio Interface ---------------
 with gr.Blocks(
+    title="LexiMind - Multi-Task NLP",
     theme=gr.themes.Soft(),
 ) as demo:
     gr.Markdown(
         """
         # 🧠 LexiMind
         ### Multi-Task Transformer for Document Analysis
+        A custom encoder-decoder Transformer trained on **summarization**, **emotion detection** (28 classes),
+        and **topic classification** (10 categories). Built from scratch with PyTorch.
+        > ⚠️ **Note**: Summarization is experimental - the model works best on news-style articles.
         """
     )
     # --------------- Try It Tab ---------------
     with gr.Tab("🚀 Try It"):
         with gr.Row():
+            with gr.Column(scale=3):
                 text_input = gr.Textbox(
+                    label="📝 Input Text",
+                    lines=6,
+                    placeholder="Enter or paste text to analyze (works best with news articles)...",
                     value=SAMPLE_TEXTS[0],
                 )
+                analyze_btn = gr.Button(
+                    "🔍 Analyze",
+                    variant="primary",
+                    size="sm",
+                )
+                gr.Markdown("**Sample Texts** (click to use):")
                 with gr.Row():
+                    sample1_btn = gr.Button("📰 Markets", size="sm", variant="secondary")
+                    sample2_btn = gr.Button("🔬 Science", size="sm", variant="secondary")
+                    sample3_btn = gr.Button("🏆 Sports", size="sm", variant="secondary")
+                sample1_btn.click(fn=lambda: SAMPLE_TEXTS[0], outputs=text_input)
+                sample2_btn.click(fn=lambda: SAMPLE_TEXTS[1], outputs=text_input)
+                sample3_btn.click(fn=lambda: SAMPLE_TEXTS[2], outputs=text_input)
             with gr.Column(scale=2):
+                gr.Markdown("### Results")
+                summary_out = gr.Textbox(
+                    label="📝 Summary",
+                    lines=3,
+                    interactive=False,
+                )
+                with gr.Row():
+                    with gr.Column():
+                        gr.Markdown("**😊 Emotions**")
+                        emotion_out = gr.Markdown(value="*Run analysis*")
+                    with gr.Column():
+                        gr.Markdown("**📂 Topic**")
+                        topic_out = gr.Markdown(value="*Run analysis*")
         analyze_btn.click(
             fn=analyze,
     # --------------- Metrics Tab ---------------
     with gr.Tab("📊 Metrics"):
+        with gr.Row():
+            with gr.Column(scale=2):
+                gr.Markdown(load_metrics())
+            with gr.Column(scale=1):
+                confusion_path = get_viz_path("topic_confusion_matrix.png")
+                if confusion_path:
+                    gr.Image(confusion_path, label="Confusion Matrix", show_label=True)
+    # --------------- Visualizations Tab ---------------
+    with gr.Tab("🎨 Visualizations"):
+        gr.Markdown("### Model Internals")
+        with gr.Row():
+            attn_path = get_viz_path("attention_visualization.png")
+            if attn_path:
+                gr.Image(attn_path, label="Self-Attention Pattern")
+            pos_path = get_viz_path("positional_encoding_heatmap.png")
+            if pos_path:
+                gr.Image(pos_path, label="Positional Encodings")
+        with gr.Row():
+            multi_path = get_viz_path("multihead_attention_visualization.png")
+            if multi_path:
+                gr.Image(multi_path, label="Multi-Head Attention")
+            single_path = get_viz_path("single_vs_multihead.png")
+            if single_path:
+                gr.Image(single_path, label="Single vs Multi-Head Comparison")
     # --------------- Architecture Tab ---------------
     with gr.Tab("🔧 Architecture"):
             """
             ### Model Architecture
+            | Component | Configuration |
+            |-----------|---------------|
+            | **Base** | Custom Transformer (encoder-decoder) |
+            | **Initialization** | FLAN-T5-base weights |
+            | **Encoder** | 6 layers, 768 hidden dim, 12 heads |
+            | **Decoder** | 6 layers with cross-attention |
+            | **Activation** | Gated-GELU |
+            | **Position** | Relative position bias |
+            ### Training Configuration
+            | Setting | Value |
+            |---------|-------|
+            | **Optimizer** | AdamW (lr=2e-5, wd=0.01) |
+            | **Scheduler** | Cosine with 1000 warmup steps |
+            | **Batch Size** | 14 × 3 accumulation = 42 effective |
+            | **Precision** | TF32 (Ampere GPU) |
+            | **Compilation** | torch.compile (inductor) |
+            ### Datasets
+            | Task | Dataset | Size |
+            |------|---------|------|
+            | **Summarization** | CNN/DailyMail + BookSum | ~110K |
+            | **Emotion** | GoEmotions | ~43K (28 labels) |
+            | **Topic** | Yahoo Answers | ~200K (10 classes) |
             """
         )
     # --------------- About Tab ---------------
     with gr.Tab("ℹ️ About"):
             """
             ### About LexiMind
+            LexiMind is a **portfolio project** demonstrating end-to-end machine learning engineering:
+            ✅ Custom Transformer implementation from scratch
+            ✅ Multi-task learning with shared encoder
+            ✅ Production-ready inference pipeline
+            ✅ Comprehensive evaluation and visualization
+            ✅ CI/CD with GitHub Actions
+            ### Known Limitations
+            - **Summarization** quality is limited (needs more training epochs)
+            - **Emotion detection** has low F1 due to class imbalance in GoEmotions
+            - Best results on **news-style text** (training domain)
             ### Links
             - 🔗 [GitHub Repository](https://github.com/OliverPerrin/LexiMind)
+            - 🤗 [Model on HuggingFace](https://huggingface.co/OliverPerrin/LexiMind-Model)
+            ---
+            **Built by Oliver Perrin** | December 2025
             """
         )

scripts/download_data.py CHANGED Viewed

@@ -85,6 +85,59 @@ TOPIC_LABELS = [
 # --------------- Utility Functions ---------------
 def _write_jsonl(records: list[dict], destination: Path, desc: str = "Writing") -> None:
     """Write records to JSONL file with progress bar."""
     destination.parent.mkdir(parents=True, exist_ok=True)

 # --------------- Utility Functions ---------------
+def _normalize_label(label: object, label_names: list[str]) -> str:
+    """Convert a label index or raw value into a string name.
+    - Valid integer indices are mapped to label_names.
+    - Everything else is stringified for robustness.
+    """
+    if isinstance(label, int) and 0 <= label < len(label_names):
+        return label_names[label]
+    return str(label)
+def _emotion_records(dataset_split: Any, label_names: list[str]) -> list[dict[str, object]]:
+    """Yield emotion records with resilient label handling."""
+    records: list[dict[str, object]] = []
+    for row in dataset_split:
+        text = str(getattr(row, "text", None) or row.get("text", ""))
+        raw_labels = getattr(row, "label", None) or row.get("label") or row.get("labels", [])
+        # Normalize to list
+        if isinstance(raw_labels, list):
+            label_values = raw_labels
+        elif raw_labels is None:
+            label_values = []
+        else:
+            label_values = [raw_labels]
+        emotions = [_normalize_label(lbl, label_names) for lbl in label_values]
+        if text:
+            records.append({"text": text, "emotions": emotions})
+    return records
+def _topic_records(dataset_split: Any, label_names: list[str]) -> list[dict[str, object]]:
+    """Yield topic records with resilient label handling."""
+    records: list[dict[str, object]] = []
+    for row in dataset_split:
+        text = str(getattr(row, "text", None) or row.get("text", ""))
+        raw_label = getattr(row, "label", None) or row.get("label") or row.get("topic")
+        if isinstance(raw_label, list):
+            label_value = raw_label[0] if raw_label else ""
+        else:
+            label_value = raw_label
+        topic = _normalize_label(label_value, label_names) if label_value is not None else ""
+        if text:
+            records.append({"text": text, "topic": topic})
+    return records
 def _write_jsonl(records: list[dict], destination: Path, desc: str = "Writing") -> None:
     """Write records to JSONL file with progress bar."""
     destination.parent.mkdir(parents=True, exist_ok=True)

scripts/process_books.py ADDED Viewed

	@@ -0,0 +1,231 @@

+"""
+Process book collection with LexiMind model.
+Analyzes each book to generate:
+- Overall topic classification
+- Dominant emotions
+- Concise summary
+Results are saved to data/processed/books/library.json for future use.
+Author: Oliver Perrin
+Date: December 2025
+"""
+from __future__ import annotations
+import json
+import sys
+from pathlib import Path
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+from src.inference.factory import create_inference_pipeline
+from src.utils.logging import configure_logging, get_logger
+configure_logging()
+logger = get_logger(__name__)
+# --------------- Configuration ---------------
+BOOKS_DIR = PROJECT_ROOT / "data" / "raw" / "books"
+OUTPUT_PATH = PROJECT_ROOT / "data" / "processed" / "books" / "library.json"
+# Chunk books into manageable sections for analysis
+MAX_CHUNK_LENGTH = 1000  # characters per chunk
+MAX_CHUNKS = 5  # analyze first N chunks to get representative sample
+# --------------- Book Processing ---------------
+def clean_text(text: str) -> str:
+    """Clean and normalize book text."""
+    # Remove Project Gutenberg headers/footers (common patterns)
+    lines = text.split("\n")
+    start_idx = 0
+    end_idx = len(lines)
+    for i, line in enumerate(lines):
+        if "START OF" in line.upper() and "PROJECT GUTENBERG" in line.upper():
+            start_idx = i + 1
+            break
+    for i in range(len(lines) - 1, -1, -1):
+        if "END OF" in lines[i].upper() and "PROJECT GUTENBERG" in lines[i].upper():
+            end_idx = i
+            break
+    text = "\n".join(lines[start_idx:end_idx])
+    # Basic cleanup
+    text = text.strip()
+    text = " ".join(text.split())  # normalize whitespace
+    return text
+def chunk_text(text: str, chunk_size: int = MAX_CHUNK_LENGTH) -> list[str]:
+    """Split text into chunks for analysis."""
+    words = text.split()
+    chunks = []
+    current_chunk = []
+    current_length = 0
+    for word in words:
+        current_chunk.append(word)
+        current_length += len(word) + 1  # +1 for space
+        if current_length >= chunk_size:
+            chunks.append(" ".join(current_chunk))
+            current_chunk = []
+            current_length = 0
+    if current_chunk:
+        chunks.append(" ".join(current_chunk))
+    return chunks
+def process_book(book_path: Path, pipeline) -> dict:
+    """Analyze a single book and return metadata."""
+    logger.info(f"Processing {book_path.name}...")
+    # Read and clean
+    try:
+        text = book_path.read_text(encoding="utf-8", errors="ignore")
+    except Exception as exc:
+        logger.error(f"Failed to read {book_path.name}: {exc}")
+        return {}
+    text = clean_text(text)
+    if not text or len(text) < 100:
+        logger.warning(f"Skipping {book_path.name} - insufficient content")
+        return {}
+    # Chunk and sample
+    chunks = chunk_text(text)
+    sample_chunks = chunks[: min(MAX_CHUNKS, len(chunks))]
+    logger.info(f"  Analyzing {len(sample_chunks)} chunks (of {len(chunks)} total)...")
+    # Run inference on chunks
+    try:
+        topics = pipeline.predict_topics(sample_chunks)
+        emotions = pipeline.predict_emotions(sample_chunks, threshold=0.3)
+        summaries = pipeline.summarize(sample_chunks, max_length=64)
+        # Aggregate results
+        # Topic: most common prediction
+        topic_counts: dict[str, int] = {}
+        for t in topics:
+            topic_counts[t.label] = topic_counts.get(t.label, 0) + 1
+        dominant_topic = max(topic_counts.items(), key=lambda x: x[1])[0]
+        # Emotion: aggregate top emotions
+        all_emotions: dict[str, list[float]] = {}
+        for emotion in emotions:
+            for label, score in zip(emotion.labels, emotion.scores, strict=False):
+                if label not in all_emotions:
+                    all_emotions[label] = []
+                all_emotions[label].append(score)
+        # Average scores and take top 3
+        emotion_scores = {
+            label: sum(scores) / len(scores) for label, scores in all_emotions.items()
+        }
+        top_emotions = sorted(emotion_scores.items(), key=lambda x: x[1], reverse=True)[:3]
+        # Summary: combine first few chunk summaries
+        combined_summary = " ".join(summaries[:3])
+        result: dict[str, object] = {
+            "title": book_path.stem.replace("_", " ").title(),
+            "filename": book_path.name,
+            "topic": dominant_topic,
+            "emotions": [{"label": label, "score": float(score)} for label, score in top_emotions],
+            "summary": combined_summary,
+            "word_count": len(text.split()),
+            "chunks_analyzed": len(sample_chunks),
+        }
+        logger.info(
+            f"  ✓ {result['title']}: {result['topic']} | "
+            f"{', '.join(str(e['label']) for e in result['emotions'][:2] if isinstance(e, dict))}"  # type: ignore[index]
+        )
+        return result
+    except Exception as exc:
+        logger.error(f"Analysis failed for {book_path.name}: {exc}", exc_info=True)
+        return {}
+# --------------- Main ---------------
+def main():
+    """Process all books and save library."""
+    logger.info("Loading inference pipeline...")
+    pipeline, label_metadata = create_inference_pipeline(
+        tokenizer_dir="artifacts/hf_tokenizer/",
+        checkpoint_path="checkpoints/best.pt",
+        labels_path="artifacts/labels.json",
+    )
+    logger.info("Finding books...")
+    book_files = sorted(BOOKS_DIR.glob("*.txt"))
+    if not book_files:
+        logger.error(f"No books found in {BOOKS_DIR}")
+        return
+    logger.info(f"Found {len(book_files)} books")
+    # Process each book
+    library = []
+    for book_path in book_files:
+        result = process_book(book_path, pipeline)
+        if result:
+            library.append(result)
+    # Save results
+    OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
+    with open(OUTPUT_PATH, "w") as f:
+        json.dump(
+            {
+                "books": library,
+                "metadata": {
+                    "total_books": len(library),
+                    "chunk_size": MAX_CHUNK_LENGTH,
+                    "chunks_per_book": MAX_CHUNKS,
+                },
+            },
+            f,
+            indent=2,
+        )
+    logger.info(f"\n✓ Library saved to {OUTPUT_PATH}")
+    logger.info(f"  Processed {len(library)} books")
+    # Print summary
+    print("\n" + "=" * 60)
+    print("BOOK LIBRARY SUMMARY")
+    print("=" * 60)
+    for book in library:
+        print(f"\n📚 {book['title']}")
+        print(f"   Topic: {book['topic']}")
+        emotions_str = ", ".join(f"{e['label']} ({e['score']:.0%})" for e in book["emotions"])
+        print(f"   Emotions: {emotions_str}")
+        print(f"   Summary: {book['summary'][:100]}...")
+    print("\n" + "=" * 60)
+if __name__ == "__main__":
+    main()

scripts/train.py CHANGED Viewed

@@ -13,6 +13,7 @@ from __future__ import annotations
 import json
 import logging
 import os
 import sys
 import time
 import warnings
@@ -51,7 +52,7 @@ from src.data.tokenization import Tokenizer, TokenizerConfig
 from src.models.factory import ModelConfig, build_multitask_model
 from src.training.trainer import Trainer, TrainerConfig
 from src.training.utils import set_seed
-from src.utils.io import save_state
 from src.utils.labels import LabelMetadata, save_label_metadata
 # --------------- Data Loading ---------------
@@ -93,12 +94,13 @@ def limit_samples(splits: Dict[str, list], cfg: DictConfig) -> None:
 def compile_model(model: torch.nn.Module) -> torch.nn.Module:
-    """Compile model with inductor backend (default mode, no CUDA graphs)."""
     from src.training.safe_compile import apply_safe_config, compile_model_safe
     # Apply safe configuration first
     apply_safe_config()
-    # Compile with default mode (inductor without CUDA graphs)
     return compile_model_safe(model, mode="default")
@@ -148,10 +150,12 @@ def main(cfg: DictConfig) -> None:
     # --------------- Tokenizer & Datasets ---------------
     tok_cfg = data_cfg.get("tokenizer", {})
     tokenizer = Tokenizer(
         TokenizerConfig(
             pretrained_model_name=tok_cfg.get("pretrained_model_name", "google/flan-t5-base"),
-            max_length=int(tok_cfg.get("max_length", 512)),
             lower=bool(tok_cfg.get("lower", False)),
         )
     )
@@ -238,6 +242,7 @@ def main(cfg: DictConfig) -> None:
     device = torch.device(cfg.device)
     model_cfg = ModelConfig(
         d_model=cfg.model.d_model,
         num_encoder_layers=cfg.model.num_encoder_layers,
         num_decoder_layers=cfg.model.num_decoder_layers,
         num_attention_heads=cfg.model.num_attention_heads,
@@ -255,12 +260,41 @@ def main(cfg: DictConfig) -> None:
         config=model_cfg,
     ).to(device)
     # Compile encoder/decoder for faster training (skip heads - small overhead)
-    if model.encoder is not None:
         from src.models.encoder import TransformerEncoder
         model.encoder = cast(TransformerEncoder, compile_model(model.encoder))
-    if model.decoder is not None:
         from src.models.decoder import TransformerDecoder
         model.decoder = cast(TransformerDecoder, compile_model(model.decoder))
@@ -268,21 +302,30 @@ def main(cfg: DictConfig) -> None:
     # --------------- Optimizer & Trainer ---------------
     opt_cfg = cfg.training.get("optimizer", {})
     optimizer = torch.optim.AdamW(
         model.parameters(),
         lr=float(opt_cfg.get("lr", 3e-5)),
         weight_decay=float(opt_cfg.get("weight_decay", 0.01)),
     )
     trainer = Trainer(
         model=model,
         optimizer=optimizer,
         config=TrainerConfig(
-            max_epochs=int(trainer_cfg.get("max_epochs", 1)),
             gradient_clip_norm=float(trainer_cfg.get("gradient_clip_norm", 1.0)),
             task_weights=trainer_cfg.get("task_weights"),
             label_smoothing=float(trainer_cfg.get("label_smoothing", 0.0)),
             gradient_accumulation_steps=int(trainer_cfg.get("gradient_accumulation_steps", 1)),
         ),
         device=device,
         tokenizer=tokenizer,
@@ -298,7 +341,12 @@ def main(cfg: DictConfig) -> None:
         save_state(model, str(path))
     print("\nStarting training...")
-    history = trainer.fit(train_loaders, val_loaders, checkpoint_callback=save_checkpoint)
     # --------------- Save Outputs ---------------

 import json
 import logging
 import os
+import re
 import sys
 import time
 import warnings
 from src.models.factory import ModelConfig, build_multitask_model
 from src.training.trainer import Trainer, TrainerConfig
 from src.training.utils import set_seed
+from src.utils.io import load_state, save_state
 from src.utils.labels import LabelMetadata, save_label_metadata
 # --------------- Data Loading ---------------
 def compile_model(model: torch.nn.Module) -> torch.nn.Module:
+    """Compile model with inductor backend (optimized for speed)."""
+    print(f"  -> Enabling torch.compile for {model.__class__.__name__}...")
     from src.training.safe_compile import apply_safe_config, compile_model_safe
     # Apply safe configuration first
     apply_safe_config()
+    # Compile with default mode (inductor) - most stable
     return compile_model_safe(model, mode="default")
     # --------------- Tokenizer & Datasets ---------------
     tok_cfg = data_cfg.get("tokenizer", {})
+    # Allow training overrides for max_length to run shorter dev sweeps
+    override_max_len = cfg.training.get("tokenizer_max_length")
     tokenizer = Tokenizer(
         TokenizerConfig(
             pretrained_model_name=tok_cfg.get("pretrained_model_name", "google/flan-t5-base"),
+            max_length=int(override_max_len or tok_cfg.get("max_length", 512)),
             lower=bool(tok_cfg.get("lower", False)),
         )
     )
     device = torch.device(cfg.device)
     model_cfg = ModelConfig(
         d_model=cfg.model.d_model,
+        vocab_size=getattr(cfg.model, "vocab_size", None),  # Override tokenizer vocab if specified
         num_encoder_layers=cfg.model.num_encoder_layers,
         num_decoder_layers=cfg.model.num_decoder_layers,
         num_attention_heads=cfg.model.num_attention_heads,
         config=model_cfg,
     ).to(device)
+    # If Training Crashes: Resume from checkpoint if provided (load before compile to avoid key mismatches)
+    start_epoch = 1
+    resume_path = cfg.get("resume_from")
+    if resume_path:
+        ckpt_path = Path(resume_path)
+        if ckpt_path.exists():
+            print(f"\n↩Resuming from checkpoint: {ckpt_path}")
+            load_state(model, str(ckpt_path))
+            # Parse epoch number robustly from filename (e.g., epoch_5.pt)
+            epoch_num = None
+            try:
+                # Prefer stem (no suffix); fallback to any digit sequence in name
+                digits = re.findall(r"\d+", ckpt_path.stem)
+                if digits:
+                    epoch_num = int(digits[-1])
+            except Exception:
+                epoch_num = None
+            if epoch_num is not None:
+                start_epoch = epoch_num + 1
+                print(f"  -> Starting from epoch {start_epoch}")
+            else:
+                print("  -> Could not parse epoch number; starting from epoch 1")
+                start_epoch = 1
+        else:
+            print(f"⚠ Resume checkpoint not found: {ckpt_path}. Starting from scratch.")
     # Compile encoder/decoder for faster training (skip heads - small overhead)
+    compile_encoder = bool(cfg.training.get("compile_encoder", True))
+    compile_decoder = bool(cfg.training.get("compile_decoder", True))
+    if compile_encoder and model.encoder is not None:
         from src.models.encoder import TransformerEncoder
         model.encoder = cast(TransformerEncoder, compile_model(model.encoder))
+    if compile_decoder and model.decoder is not None:
         from src.models.decoder import TransformerDecoder
         model.decoder = cast(TransformerDecoder, compile_model(model.decoder))
     # --------------- Optimizer & Trainer ---------------
     opt_cfg = cfg.training.get("optimizer", {})
+    sched_cfg = cfg.training.get("scheduler", {})
     optimizer = torch.optim.AdamW(
         model.parameters(),
         lr=float(opt_cfg.get("lr", 3e-5)),
         weight_decay=float(opt_cfg.get("weight_decay", 0.01)),
     )
+    # Clamp start_epoch to max_epochs to avoid empty loop
+    max_epochs = int(trainer_cfg.get("max_epochs", 1))
+    if start_epoch > max_epochs:
+        print(f"⚠ resume_from points past max_epochs ({max_epochs}); nothing to train. Setting start_epoch to {max_epochs}")
+        start_epoch = max_epochs
     trainer = Trainer(
         model=model,
         optimizer=optimizer,
         config=TrainerConfig(
+            max_epochs=max_epochs,
             gradient_clip_norm=float(trainer_cfg.get("gradient_clip_norm", 1.0)),
             task_weights=trainer_cfg.get("task_weights"),
             label_smoothing=float(trainer_cfg.get("label_smoothing", 0.0)),
             gradient_accumulation_steps=int(trainer_cfg.get("gradient_accumulation_steps", 1)),
+            scheduler_type=str(sched_cfg.get("name", "constant")),
+            warmup_steps=int(sched_cfg.get("warmup_steps", 0)),
         ),
         device=device,
         tokenizer=tokenizer,
         save_state(model, str(path))
     print("\nStarting training...")
+    history = trainer.fit(
+        train_loaders,
+        val_loaders,
+        checkpoint_callback=save_checkpoint,
+        start_epoch=start_epoch,
+    )
     # --------------- Save Outputs ---------------

scripts/visualize_training.py ADDED Viewed

	@@ -0,0 +1,341 @@

+"""
+Visualize training metrics from MLflow runs.
+Generates plots showing:
+- Loss curves (training/validation)
+- Task-specific metrics over time
+- Learning rate schedule
+- Training speed analysis
+Author: Oliver Perrin
+Date: December 2025
+"""
+from __future__ import annotations
+import json
+import sys
+from pathlib import Path
+import matplotlib.pyplot as plt
+import mlflow
+import mlflow.tracking
+import seaborn as sns
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+from src.utils.logging import configure_logging, get_logger
+configure_logging()
+logger = get_logger(__name__)
+# Configure plotting style
+sns.set_style("whitegrid")
+plt.rcParams["figure.figsize"] = (12, 8)
+plt.rcParams["figure.dpi"] = 100
+OUTPUTS_DIR = PROJECT_ROOT / "outputs"
+MLRUNS_DIR = PROJECT_ROOT / "mlruns"
+def load_training_history() -> dict[str, object] | None:
+    """Load training history from JSON if available."""
+    history_path = OUTPUTS_DIR / "training_history.json"
+    if history_path.exists():
+        with open(history_path) as f:
+            data: dict[str, object] = json.load(f)
+            return data
+    return None
+def get_latest_run():
+    """Get the most recent MLflow run."""
+    mlflow.set_tracking_uri(f"file://{MLRUNS_DIR}")
+    client = mlflow.tracking.MlflowClient()
+    # Get the experiment (LexiMind)
+    experiment = client.get_experiment_by_name("LexiMind")
+    if not experiment:
+        logger.error("No 'LexiMind' experiment found")
+        return None
+    # Get all runs, sorted by start time
+    runs = client.search_runs(
+        experiment_ids=[experiment.experiment_id],
+        order_by=["start_time DESC"],
+        max_results=1,
+    )
+    if not runs:
+        logger.error("No runs found in experiment")
+        return None
+    return runs[0]
+def plot_loss_curves(run):
+    """Plot training and validation loss over time."""
+    client = mlflow.tracking.MlflowClient()
+    # Get metrics
+    train_loss = client.get_metric_history(run.info.run_id, "train_total_loss")
+    val_loss = client.get_metric_history(run.info.run_id, "val_total_loss")
+    fig, ax = plt.subplots(figsize=(12, 6))
+    if not train_loss:
+        # Create placeholder plot
+        ax.text(
+            0.5,
+            0.5,
+            "No training data yet\n\nWaiting for first epoch to complete...",
+            ha="center",
+            va="center",
+            fontsize=14,
+            color="gray",
+        )
+        ax.set_xlim(0, 1)
+        ax.set_ylim(0, 1)
+    else:
+        # Extract steps and values
+        train_steps = [m.step for m in train_loss]
+        train_values = [m.value for m in train_loss]
+        ax.plot(train_steps, train_values, label="Training Loss", linewidth=2, alpha=0.8)
+        if val_loss:
+            val_steps = [m.step for m in val_loss]
+            val_values = [m.value for m in val_loss]
+            ax.plot(val_steps, val_values, label="Validation Loss", linewidth=2, alpha=0.8)
+        ax.legend(fontsize=11)
+    ax.set_xlabel("Epoch", fontsize=12)
+    ax.set_ylabel("Loss", fontsize=12)
+    ax.set_title("Training Progress: Total Loss", fontsize=14, fontweight="bold")
+    ax.grid(True, alpha=0.3)
+    plt.tight_layout()
+    output_path = OUTPUTS_DIR / "training_loss_curve.png"
+    plt.savefig(output_path, dpi=150, bbox_inches="tight")
+    logger.info(f"✓ Saved loss curve to {output_path}")
+    plt.close()
+def plot_task_metrics(run):
+    """Plot metrics for each task."""
+    client = mlflow.tracking.MlflowClient()
+    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
+    fig.suptitle("Task-Specific Training Metrics", fontsize=16, fontweight="bold")
+    # Summarization
+    ax = axes[0, 0]
+    train_sum = client.get_metric_history(run.info.run_id, "train_summarization_loss")
+    val_sum = client.get_metric_history(run.info.run_id, "val_summarization_loss")
+    if train_sum:
+        ax.plot(
+            [m.step for m in train_sum], [m.value for m in train_sum], label="Train", linewidth=2
+        )
+    if val_sum:
+        ax.plot([m.step for m in val_sum], [m.value for m in val_sum], label="Val", linewidth=2)
+    ax.set_title("Summarization Loss", fontweight="bold")
+    ax.set_xlabel("Epoch")
+    ax.set_ylabel("Loss")
+    ax.legend()
+    ax.grid(True, alpha=0.3)
+    # Emotion
+    ax = axes[0, 1]
+    train_emo = client.get_metric_history(run.info.run_id, "train_emotion_loss")
+    val_emo = client.get_metric_history(run.info.run_id, "val_emotion_loss")
+    train_f1 = client.get_metric_history(run.info.run_id, "train_emotion_f1")
+    val_f1 = client.get_metric_history(run.info.run_id, "val_emotion_f1")
+    if train_emo:
+        ax.plot(
+            [m.step for m in train_emo],
+            [m.value for m in train_emo],
+            label="Train Loss",
+            linewidth=2,
+        )
+    if val_emo:
+        ax.plot(
+            [m.step for m in val_emo], [m.value for m in val_emo], label="Val Loss", linewidth=2
+        )
+    ax2 = ax.twinx()
+    if train_f1:
+        ax2.plot(
+            [m.step for m in train_f1],
+            [m.value for m in train_f1],
+            label="Train F1",
+            linewidth=2,
+            linestyle="--",
+            alpha=0.7,
+        )
+    if val_f1:
+        ax2.plot(
+            [m.step for m in val_f1],
+            [m.value for m in val_f1],
+            label="Val F1",
+            linewidth=2,
+            linestyle="--",
+            alpha=0.7,
+        )
+    ax.set_title("Emotion Detection", fontweight="bold")
+    ax.set_xlabel("Epoch")
+    ax.set_ylabel("Loss")
+    ax2.set_ylabel("F1 Score")
+    ax.legend(loc="upper left")
+    ax2.legend(loc="upper right")
+    ax.grid(True, alpha=0.3)
+    # Topic
+    ax = axes[1, 0]
+    train_topic = client.get_metric_history(run.info.run_id, "train_topic_loss")
+    val_topic = client.get_metric_history(run.info.run_id, "val_topic_loss")
+    train_acc = client.get_metric_history(run.info.run_id, "train_topic_accuracy")
+    val_acc = client.get_metric_history(run.info.run_id, "val_topic_accuracy")
+    if train_topic:
+        ax.plot(
+            [m.step for m in train_topic],
+            [m.value for m in train_topic],
+            label="Train Loss",
+            linewidth=2,
+        )
+    if val_topic:
+        ax.plot(
+            [m.step for m in val_topic], [m.value for m in val_topic], label="Val Loss", linewidth=2
+        )
+    ax2 = ax.twinx()
+    if train_acc:
+        ax2.plot(
+            [m.step for m in train_acc],
+            [m.value for m in train_acc],
+            label="Train Acc",
+            linewidth=2,
+            linestyle="--",
+            alpha=0.7,
+        )
+    if val_acc:
+        ax2.plot(
+            [m.step for m in val_acc],
+            [m.value for m in val_acc],
+            label="Val Acc",
+            linewidth=2,
+            linestyle="--",
+            alpha=0.7,
+        )
+    ax.set_title("Topic Classification", fontweight="bold")
+    ax.set_xlabel("Epoch")
+    ax.set_ylabel("Loss")
+    ax2.set_ylabel("Accuracy")
+    ax.legend(loc="upper left")
+    ax2.legend(loc="upper right")
+    ax.grid(True, alpha=0.3)
+    # Summary statistics
+    ax = axes[1, 1]
+    ax.axis("off")
+    # Get final metrics
+    summary_text = "Final Metrics (Last Epoch)\n" + "=" * 35 + "\n\n"
+    if val_topic and val_acc:
+        summary_text += f"Topic Accuracy: {val_acc[-1].value:.1%}\n"
+    if val_emo and val_f1:
+        summary_text += f"Emotion F1: {val_f1[-1].value:.1%}\n"
+    if val_sum:
+        summary_text += f"Summarization Loss: {val_sum[-1].value:.3f}\n"
+    ax.text(0.1, 0.5, summary_text, fontsize=12, family="monospace", verticalalignment="center")
+    plt.tight_layout()
+    output_path = OUTPUTS_DIR / "task_metrics.png"
+    plt.savefig(output_path, dpi=150, bbox_inches="tight")
+    logger.info(f"✓ Saved task metrics to {output_path}")
+    plt.close()
+def plot_learning_rate(run):
+    """Plot learning rate schedule if available."""
+    client = mlflow.tracking.MlflowClient()
+    lr_metrics = client.get_metric_history(run.info.run_id, "learning_rate")
+    fig, ax = plt.subplots(figsize=(12, 5))
+    if not lr_metrics:
+        # Create placeholder
+        ax.text(
+            0.5,
+            0.5,
+            "No learning rate data yet\n\n(Will be logged in future training runs)",
+            ha="center",
+            va="center",
+            fontsize=14,
+            color="gray",
+        )
+        ax.set_xlim(0, 1)
+        ax.set_ylim(0, 1)
+    else:
+        steps = [m.step for m in lr_metrics]
+        values = [m.value for m in lr_metrics]
+        ax.plot(steps, values, linewidth=2, color="darkblue")
+        # Mark warmup region
+        warmup_steps = 1000  # From config
+        if warmup_steps < max(steps):
+            ax.axvline(warmup_steps, color="red", linestyle="--", alpha=0.5, label="Warmup End")
+            ax.legend()
+    ax.set_xlabel("Step", fontsize=12)
+    ax.set_ylabel("Learning Rate", fontsize=12)
+    ax.set_title("Learning Rate Schedule (Cosine with Warmup)", fontsize=14, fontweight="bold")
+    ax.grid(True, alpha=0.3)
+    plt.tight_layout()
+    output_path = OUTPUTS_DIR / "learning_rate_schedule.png"
+    plt.savefig(output_path, dpi=150, bbox_inches="tight")
+    logger.info(f"✓ Saved LR schedule to {output_path}")
+    plt.close()
+def main():
+    """Generate all training visualizations."""
+    logger.info("Loading MLflow data...")
+    run = get_latest_run()
+    if not run:
+        logger.error("No training run found. Make sure training has started.")
+        return
+    logger.info(f"Analyzing run: {run.info.run_id}")
+    OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)
+    logger.info("Generating visualizations...")
+    plot_loss_curves(run)
+    plot_task_metrics(run)
+    plot_learning_rate(run)
+    logger.info("\n" + "=" * 60)
+    logger.info("✓ All visualizations saved to outputs/")
+    logger.info("=" * 60)
+    logger.info("  - training_loss_curve.png")
+    logger.info("  - task_metrics.png")
+    logger.info("  - learning_rate_schedule.png")
+    logger.info("=" * 60)
+if __name__ == "__main__":
+    main()

src/data/dataloader.py CHANGED Viewed

@@ -48,13 +48,16 @@ class SummarizationCollator:
         src_enc = self.tokenizer.batch_encode(sources, max_length=self.max_source_length)
         tgt_enc = self.tokenizer.batch_encode(targets, max_length=self.max_target_length)
-        # Shift targets: tgt_ids = [BOS, A, B], labels = [A, B, EOS]
         ids = tgt_enc["input_ids"]
         mask = tgt_enc["attention_mask"]
-        tgt_ids = ids[:, :-1]
-        labels = ids[:, 1:].clone()
-        labels[mask[:, 1:] == 0] = -100  # Mask padding in loss
         return {
             "src_ids": src_enc["input_ids"],

         src_enc = self.tokenizer.batch_encode(sources, max_length=self.max_source_length)
         tgt_enc = self.tokenizer.batch_encode(targets, max_length=self.max_target_length)
         ids = tgt_enc["input_ids"]
         mask = tgt_enc["attention_mask"]
+        # Create labels for loss: mask padding with -100
+        labels = ids.clone()
+        labels[mask == 0] = -100
+        # Create decoder inputs from original ids (no -100)
+        # prepare_decoder_inputs shifts right and adds BOS
+        tgt_ids = self.tokenizer.prepare_decoder_inputs(ids)
         return {
             "src_ids": src_enc["input_ids"],

src/inference/pipeline.py CHANGED Viewed

@@ -69,6 +69,7 @@ class InferenceConfig:
     summary_max_length: int = 128
     summary_repetition_penalty: float = 1.2  # Penalize repeated tokens
     emotion_threshold: float = 0.5
     device: str | None = None
@@ -164,6 +165,8 @@ class InferencePipeline:
         # Decode and format summaries
         raw_summaries = self.tokenizer.decode_batch(generated.tolist())
         return [_format_summary(s) for s in raw_summaries]
     # --------------- Emotion ---------------

     summary_max_length: int = 128
     summary_repetition_penalty: float = 1.2  # Penalize repeated tokens
+    summary_formatting: bool = True  # Apply text cleanup/formatting to generated summaries
     emotion_threshold: float = 0.5
     device: str | None = None
         # Decode and format summaries
         raw_summaries = self.tokenizer.decode_batch(generated.tolist())
+        if not self.config.summary_formatting:
+            return raw_summaries
         return [_format_summary(s) for s in raw_summaries]
     # --------------- Emotion ---------------

src/inference/postprocessing.py DELETED Viewed

@@ -1,14 +0,0 @@
-"""
-Output postprocessing utilities for LexiMind.
-Provides text cleaning helpers for model outputs.
-Author: Oliver Perrin
-Date: December 2025
-"""
-from typing import List
-def strip_whitespace(texts: List[str]) -> List[str]:
-    return [text.strip() for text in texts]

src/models/decoder.py CHANGED Viewed

@@ -18,10 +18,12 @@ from typing import Any, Dict, List, Literal, Optional, Tuple, Union, cast
 import torch
 import torch.nn as nn
 from .attention import MultiHeadAttention, T5RelativePositionBias
 from .feedforward import FeedForward
 from .positional_encoding import LearnedPositionalEncoding, PositionalEncoding
 def create_causal_mask(seq_len: int, device: Optional[torch.device] = None) -> torch.Tensor:
@@ -77,9 +79,9 @@ class TransformerDecoderLayer(nn.Module):
             quantization=quantization,
         )
-        self.norm1 = nn.RMSNorm(d_model)
-        self.norm2 = nn.RMSNorm(d_model)
-        self.norm3 = nn.RMSNorm(d_model)
         self.dropout1 = nn.Dropout(dropout)
         self.dropout2 = nn.Dropout(dropout)
@@ -189,6 +191,7 @@ class TransformerDecoder(nn.Module):
         use_learned_pos_enc: bool = False,
         activation: Literal["gelu", "relu", "swiglu", "gated-gelu"] = "gated-gelu",
         use_relative_position_bias: bool = False,  # T5-style relative position bias
     ):
         super().__init__()
         self.vocab_size = vocab_size
@@ -196,8 +199,10 @@ class TransformerDecoder(nn.Module):
         self.pad_token_id = pad_token_id
         self.num_heads = num_heads
         self.use_relative_position_bias = use_relative_position_bias
         self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_token_id)
         # Positional encoding (disabled when using relative position bias for T5)
         self.self_relative_position_bias: Optional[T5RelativePositionBias] = None
@@ -238,8 +243,8 @@ class TransformerDecoder(nn.Module):
             ]
         )
-        self.final_norm = nn.RMSNorm(d_model)
-        self.output_projection = nn.Linear(d_model, vocab_size)
         self.input_dropout = nn.Dropout(dropout)
     def _build_padding_mask_from_ids(self, input_ids: torch.Tensor) -> torch.Tensor:
@@ -252,6 +257,18 @@ class TransformerDecoder(nn.Module):
         """
         assert self.pad_token_id is not None, "pad_token_id must be set to build mask from ids"
         pad_mask = input_ids != self.pad_token_id  # (B, T)
         attn_mask = pad_mask.unsqueeze(1) & pad_mask.unsqueeze(2)  # (B, T, T)
         return attn_mask
@@ -263,7 +280,7 @@ class TransformerDecoder(nn.Module):
         memory_mask: Optional[torch.Tensor] = None,
         collect_attn: bool = False,
         skip_padding_mask: bool = False,  # Set True during generation to avoid masking start token
-    ) -> Union[torch.Tensor, Tuple[torch.Tensor, List[Dict[str, torch.Tensor]]]]:
         """
         Args:
             inputs: (B, T) token ids or (B, T, d_model) embeddings
@@ -304,6 +321,12 @@ class TransformerDecoder(nn.Module):
         else:
             # Ensure boolean and device alignment; accept (B, T, T) or (B,1,T,T) or (1,1,T,T)
             tgt_mask = tgt_mask.to(dtype=torch.bool, device=x.device)
         # Normalize memory_mask dtype/device and expand simple shapes
         if memory_mask is not None:
@@ -313,7 +336,7 @@ class TransformerDecoder(nn.Module):
             elif memory_mask.dim() == 3:  # (B, T, S) -> (B, 1, T, S)
                 memory_mask = memory_mask.unsqueeze(1)
-        attn_list: List[Dict[str, torch.Tensor]] = []
         # Compute relative position biases (T5-style)
         # Note: T5 uses relative position bias for self-attention but NOT for cross-attention
@@ -328,19 +351,37 @@ class TransformerDecoder(nn.Module):
         # Pass through decoder layers
         for layer in self.layers:
-            x, attn = layer(
-                x,
-                memory,
-                tgt_mask=tgt_mask,
-                memory_mask=memory_mask,
-                collect_attn=collect_attn,
-                self_attn_position_bias=self_position_bias,
-                cross_attn_position_bias=cross_position_bias,
-            )
             if collect_attn:
                 attn_list.append(attn)
         x = self.final_norm(x)
         logits = self.output_projection(x)  # (B, T, vocab)
         if collect_attn:

 import torch
 import torch.nn as nn
+from torch.utils.checkpoint import checkpoint
 from .attention import MultiHeadAttention, T5RelativePositionBias
 from .feedforward import FeedForward
 from .positional_encoding import LearnedPositionalEncoding, PositionalEncoding
+from .t5_layer_norm import T5LayerNorm
 def create_causal_mask(seq_len: int, device: Optional[torch.device] = None) -> torch.Tensor:
             quantization=quantization,
         )
+        self.norm1 = T5LayerNorm(d_model)
+        self.norm2 = T5LayerNorm(d_model)
+        self.norm3 = T5LayerNorm(d_model)
         self.dropout1 = nn.Dropout(dropout)
         self.dropout2 = nn.Dropout(dropout)
         use_learned_pos_enc: bool = False,
         activation: Literal["gelu", "relu", "swiglu", "gated-gelu"] = "gated-gelu",
         use_relative_position_bias: bool = False,  # T5-style relative position bias
+        gradient_checkpointing: bool = False,
     ):
         super().__init__()
         self.vocab_size = vocab_size
         self.pad_token_id = pad_token_id
         self.num_heads = num_heads
         self.use_relative_position_bias = use_relative_position_bias
+        self.gradient_checkpointing = gradient_checkpointing
         self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_token_id)
+        # Note: T5 does NOT scale logits (scaling factor removed)
         # Positional encoding (disabled when using relative position bias for T5)
         self.self_relative_position_bias: Optional[T5RelativePositionBias] = None
             ]
         )
+        self.final_norm = T5LayerNorm(d_model)
+        self.output_projection = nn.Linear(d_model, vocab_size, bias=False)  # T5 has no bias
         self.input_dropout = nn.Dropout(dropout)
     def _build_padding_mask_from_ids(self, input_ids: torch.Tensor) -> torch.Tensor:
         """
         assert self.pad_token_id is not None, "pad_token_id must be set to build mask from ids"
         pad_mask = input_ids != self.pad_token_id  # (B, T)
+        # Always allow attending to the first token (BOS), even if it is pad_token_id
+        # Avoid in-place mutation for better torch.compile compatibility
+        if pad_mask.size(1) > 0:
+            # Create a mask for the first column (B, 1)
+            first_col_mask = torch.zeros_like(pad_mask[:, :1], dtype=torch.bool)
+            first_col_mask[:] = True
+            # Combine: pad_mask OR (column==0)
+            # We can do this by creating a column index tensor
+            col_indices = torch.arange(pad_mask.size(1), device=pad_mask.device).unsqueeze(0)
+            pad_mask = pad_mask | (col_indices == 0)
         attn_mask = pad_mask.unsqueeze(1) & pad_mask.unsqueeze(2)  # (B, T, T)
         return attn_mask
         memory_mask: Optional[torch.Tensor] = None,
         collect_attn: bool = False,
         skip_padding_mask: bool = False,  # Set True during generation to avoid masking start token
+    ) -> Union[torch.Tensor, Tuple[torch.Tensor, List[Dict[str, Optional[torch.Tensor]]]]]:
         """
         Args:
             inputs: (B, T) token ids or (B, T, d_model) embeddings
         else:
             # Ensure boolean and device alignment; accept (B, T, T) or (B,1,T,T) or (1,1,T,T)
             tgt_mask = tgt_mask.to(dtype=torch.bool, device=x.device)
+            # If tgt_mask is just causal (T, T), expand it
+            if tgt_mask.dim() == 2:
+                tgt_mask = tgt_mask.unsqueeze(0).unsqueeze(0)
+            elif tgt_mask.dim() == 3:
+                tgt_mask = tgt_mask.unsqueeze(1)
         # Normalize memory_mask dtype/device and expand simple shapes
         if memory_mask is not None:
             elif memory_mask.dim() == 3:  # (B, T, S) -> (B, 1, T, S)
                 memory_mask = memory_mask.unsqueeze(1)
+        attn_list: List[Dict[str, Optional[torch.Tensor]]] = []
         # Compute relative position biases (T5-style)
         # Note: T5 uses relative position bias for self-attention but NOT for cross-attention
         # Pass through decoder layers
         for layer in self.layers:
+            if self.gradient_checkpointing and self.training:
+                # Gradient checkpointing requires the inputs to require grad
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, tgt_mask=tgt_mask, memory_mask=memory_mask, collect_attn=collect_attn, self_attn_position_bias=self_position_bias, cross_attn_position_bias=cross_position_bias)
+                    return custom_forward
+                x, attn = cast(
+                    Tuple[torch.Tensor, Dict[str, Optional[torch.Tensor]]],
+                    checkpoint(
+                        create_custom_forward(layer),
+                        x,
+                        memory,
+                        use_reentrant=False,
+                    ),
+                )
+            else:
+                x, attn = layer(
+                    x,
+                    memory,
+                    tgt_mask=tgt_mask,
+                    memory_mask=memory_mask,
+                    collect_attn=collect_attn,
+                    self_attn_position_bias=self_position_bias,
+                    cross_attn_position_bias=cross_position_bias,
+                )
             if collect_attn:
                 attn_list.append(attn)
         x = self.final_norm(x)
+        # T5 does NOT scale logits - direct projection to vocabulary
         logits = self.output_projection(x)  # (B, T, vocab)
         if collect_attn:

src/models/encoder.py CHANGED Viewed

@@ -13,15 +13,17 @@ Author: Oliver Perrin
 Date: 2025-10-23
 """
-from typing import List, Literal, Optional, Tuple, Union
 import torch
 import torch.nn as nn
 # Encoder implementation
 from .attention import MultiHeadAttention, T5RelativePositionBias
 from .feedforward import FeedForward
 from .positional_encoding import LearnedPositionalEncoding, PositionalEncoding
 class TransformerEncoderLayer(nn.Module):
@@ -65,8 +67,8 @@ class TransformerEncoderLayer(nn.Module):
             quantization=quantization,
         )
-        self.norm1 = nn.RMSNorm(d_model)
-        self.norm2 = nn.RMSNorm(d_model)
         self.dropout1 = nn.Dropout(dropout)
         self.dropout2 = nn.Dropout(dropout)
@@ -153,12 +155,14 @@ class TransformerEncoder(nn.Module):
         use_learned_pos_enc: bool = False,
         activation: Literal["gelu", "relu", "swiglu", "gated-gelu"] = "gated-gelu",
         use_relative_position_bias: bool = False,  # T5-style relative position bias
     ):
         super().__init__()
         self.vocab_size = vocab_size
         self.d_model = d_model
         self.pad_token_id = pad_token_id
         self.use_relative_position_bias = use_relative_position_bias
         # Token embedding (only used if forward receives token ids)
         self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_token_id)
@@ -201,8 +205,8 @@ class TransformerEncoder(nn.Module):
             ]
         )
-        # Final RMSNorm for Pre-LN stacks (recommended)
-        self.final_norm = nn.RMSNorm(d_model)
         # Dropout applied after embedding + positional encoding (paper uses this)
         self.input_dropout = nn.Dropout(dropout)
@@ -282,7 +286,25 @@ class TransformerEncoder(nn.Module):
         # Pass through each encoder layer (optionally collect attn)
         for layer in self.layers:
-            x, attn = layer(x, mask=mask, collect_attn=collect_attn, position_bias=position_bias)
             if collect_attn:
                 attn_weights_per_layer.append(attn)

 Date: 2025-10-23
 """
+from typing import List, Literal, Optional, Tuple, Union, cast
 import torch
 import torch.nn as nn
+from torch.utils.checkpoint import checkpoint
 # Encoder implementation
 from .attention import MultiHeadAttention, T5RelativePositionBias
 from .feedforward import FeedForward
 from .positional_encoding import LearnedPositionalEncoding, PositionalEncoding
+from .t5_layer_norm import T5LayerNorm
 class TransformerEncoderLayer(nn.Module):
             quantization=quantization,
         )
+        self.norm1 = T5LayerNorm(d_model)
+        self.norm2 = T5LayerNorm(d_model)
         self.dropout1 = nn.Dropout(dropout)
         self.dropout2 = nn.Dropout(dropout)
         use_learned_pos_enc: bool = False,
         activation: Literal["gelu", "relu", "swiglu", "gated-gelu"] = "gated-gelu",
         use_relative_position_bias: bool = False,  # T5-style relative position bias
+        gradient_checkpointing: bool = False,
     ):
         super().__init__()
         self.vocab_size = vocab_size
         self.d_model = d_model
         self.pad_token_id = pad_token_id
         self.use_relative_position_bias = use_relative_position_bias
+        self.gradient_checkpointing = gradient_checkpointing
         # Token embedding (only used if forward receives token ids)
         self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_token_id)
             ]
         )
+        # Final T5LayerNorm for Pre-LN stacks
+        self.final_norm = T5LayerNorm(d_model)
         # Dropout applied after embedding + positional encoding (paper uses this)
         self.input_dropout = nn.Dropout(dropout)
         # Pass through each encoder layer (optionally collect attn)
         for layer in self.layers:
+            if self.gradient_checkpointing and self.training:
+                # Gradient checkpointing requires the inputs to require grad
+                # We use a lambda to pass keyword arguments
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, mask=mask, collect_attn=collect_attn, position_bias=position_bias)
+                    return custom_forward
+                x, attn = cast(
+                    Tuple[torch.Tensor, Optional[torch.Tensor]],
+                    checkpoint(
+                        create_custom_forward(layer),
+                        x,
+                        use_reentrant=False,
+                    ),
+                )
+            else:
+                x, attn = layer(x, mask=mask, collect_attn=collect_attn, position_bias=position_bias)
             if collect_attn:
                 attn_weights_per_layer.append(attn)

src/models/factory.py CHANGED Viewed

@@ -14,15 +14,15 @@ from __future__ import annotations
 from dataclasses import dataclass
 from pathlib import Path
-from typing import Literal, Optional, cast
 import torch
 from transformers import T5ForConditionalGeneration
 from ..data.tokenization import Tokenizer
 from ..utils.config import load_yaml
-from .decoder import TransformerDecoder
-from .encoder import TransformerEncoder
 from .heads import ClassificationHead, LMHead
 from .multitask import MultiTaskModel
@@ -35,6 +35,7 @@ class ModelConfig:
     """Configuration describing the transformer architecture."""
     d_model: int = 768
     num_encoder_layers: int = 12
     num_decoder_layers: int = 12
     num_attention_heads: int = 12
@@ -50,6 +51,7 @@ class ModelConfig:
     use_relative_position_bias: bool = (
         False  # T5-style relative position bias (use True for T5/FLAN-T5)
     )
     def __post_init__(self):
         if self.d_model % self.num_attention_heads != 0:
@@ -77,6 +79,7 @@ def load_model_config(path: Optional[str | Path]) -> ModelConfig:
     data = load_yaml(str(path)).data
     return ModelConfig(
         d_model=int(data.get("d_model", 512)),
         num_encoder_layers=int(data.get("num_encoder_layers", 6)),
         num_decoder_layers=int(data.get("num_decoder_layers", 6)),
         num_attention_heads=int(data.get("num_attention_heads", 8)),
@@ -88,6 +91,7 @@ def load_model_config(path: Optional[str | Path]) -> ModelConfig:
         use_learned_pos_enc=bool(data.get("use_learned_pos_enc", True)),
         activation=str(data.get("activation", "gelu")),
         use_relative_position_bias=bool(data.get("use_relative_position_bias", False)),
     )
@@ -107,11 +111,10 @@ def _load_pretrained_weights(
       -> We zero-initialize the bias terms
     """
     print(f"Loading pretrained weights from {model_name}...")
-    t5 = T5ForConditionalGeneration.from_pretrained(model_name)
     # Load shared embeddings (T5 uses shared embeddings for encoder and decoder)
     # Note: T5's vocab is padded to multiple of 128 for efficiency (32100 -> 32128)
-    # Our model uses the tokenizer's actual vocab size, so we only copy the valid tokens
     print("Transferring shared token embeddings...")
     shared_embeddings = t5.shared.weight.data
     our_vocab_size = encoder.embedding.weight.size(0)
@@ -124,6 +127,19 @@ def _load_pretrained_weights(
         print(f"  Copying first {min_vocab} token embeddings...")
         encoder.embedding.weight.data[:min_vocab].copy_(shared_embeddings[:min_vocab])
         decoder.embedding.weight.data[:min_vocab].copy_(shared_embeddings[:min_vocab])
     else:
         encoder.embedding.weight.data.copy_(shared_embeddings)
         decoder.embedding.weight.data.copy_(shared_embeddings)
@@ -136,11 +152,13 @@ def _load_pretrained_weights(
     print("Transferring encoder weights...")
     t5_encoder = t5.encoder
-    for custom_layer, t5_layer in zip(encoder.layers, t5_encoder.block, strict=False):
-        t5_self_attn = t5_layer.layer[0].SelfAttention
-        t5_ffn = t5_layer.layer[1].DenseReluDense
-        t5_norm1 = t5_layer.layer[0].layer_norm
-        t5_norm2 = t5_layer.layer[1].layer_norm
         # Self-attention (T5 has no bias in attention projections)
         custom_layer.self_attn.W_Q.weight.data.copy_(t5_self_attn.q.weight.data)
@@ -190,7 +208,7 @@ def _load_pretrained_weights(
     if hasattr(encoder, "relative_position_bias") and encoder.relative_position_bias is not None:
         print("Transferring encoder relative position bias...")
         t5_enc_rel_bias = (
-            t5_encoder.block[0].layer[0].SelfAttention.relative_attention_bias.weight.data
         )
         encoder.relative_position_bias.relative_attention_bias.weight.data.copy_(t5_enc_rel_bias)
@@ -198,13 +216,15 @@ def _load_pretrained_weights(
     print("Transferring decoder weights...")
     t5_decoder = t5.decoder
-    for custom_layer, t5_layer in zip(decoder.layers, t5_decoder.block, strict=False):
-        t5_self_attn = t5_layer.layer[0].SelfAttention
-        t5_cross_attn = t5_layer.layer[1].EncDecAttention
-        t5_ffn = t5_layer.layer[2].DenseReluDense
-        t5_norm1 = t5_layer.layer[0].layer_norm
-        t5_norm2 = t5_layer.layer[1].layer_norm
-        t5_norm3 = t5_layer.layer[2].layer_norm
         # Self-attention
         custom_layer.self_attn.W_Q.weight.data.copy_(t5_self_attn.q.weight.data)
@@ -265,7 +285,7 @@ def _load_pretrained_weights(
     ):
         print("Transferring decoder self-attention relative position bias...")
         t5_dec_self_rel_bias = (
-            t5_decoder.block[0].layer[0].SelfAttention.relative_attention_bias.weight.data
         )
         decoder.self_relative_position_bias.relative_attention_bias.weight.data.copy_(
             t5_dec_self_rel_bias
@@ -278,7 +298,7 @@ def _load_pretrained_weights(
         print("Transferring decoder cross-attention relative position bias...")
         # Cross-attention relative position bias is in EncDecAttention of first block
         t5_dec_cross_rel_bias = (
-            t5_decoder.block[0].layer[1].EncDecAttention.relative_attention_bias.weight.data
         )
         decoder.cross_relative_position_bias.relative_attention_bias.weight.data.copy_(
             t5_dec_cross_rel_bias
@@ -367,9 +387,9 @@ def _load_llama_weights(
     num_layers = min(len(encoder.layers), len(llama.model.layers))
     for i in range(num_layers):
-        llama_layer = llama.model.layers[i]
-        enc_layer = encoder.layers[i]
-        dec_layer = decoder.layers[i]
         # --- Self-Attention ---
         # Llama: q_proj, k_proj, v_proj, o_proj
@@ -460,15 +480,19 @@ def build_multitask_model(
     if hasattr(tokenizer, "config") and hasattr(tokenizer.config, "max_length"):
         max_len = tokenizer.config.max_length
     elif hasattr(tokenizer, "model_max_length"):
-        max_len = tokenizer.model_max_length
     else:
         max_len = 512  # Default fallback
     # Cast activation to the literal type for mypy
     activation = cast(ActivationType, cfg.activation)
     encoder = TransformerEncoder(
-        vocab_size=tokenizer.vocab_size,
         d_model=cfg.d_model,
         num_layers=cfg.num_encoder_layers,
         num_heads=cfg.num_attention_heads,
@@ -480,9 +504,10 @@ def build_multitask_model(
         use_learned_pos_enc=cfg.use_learned_pos_enc,
         activation=activation,
         use_relative_position_bias=cfg.use_relative_position_bias,
     )
     decoder = TransformerDecoder(
-        vocab_size=tokenizer.vocab_size,
         d_model=cfg.d_model,
         num_layers=cfg.num_decoder_layers,
         num_heads=cfg.num_attention_heads,
@@ -494,6 +519,7 @@ def build_multitask_model(
         use_learned_pos_enc=cfg.use_learned_pos_enc,
         activation=activation,
         use_relative_position_bias=cfg.use_relative_position_bias,
     )
     # Load pretrained weights if requested (but allow override for inference)
@@ -513,12 +539,14 @@ def build_multitask_model(
             )
             _load_pretrained_weights(encoder, decoder, cfg.pretrained_model_name)
     model = MultiTaskModel(encoder=encoder, decoder=decoder, decoder_outputs_logits=True)
     model.add_head(
         "summarization",
-        LMHead(
-            d_model=cfg.d_model, vocab_size=tokenizer.vocab_size, tie_embedding=decoder.embedding
-        ),
     )
     model.add_head(
         "emotion",

 from dataclasses import dataclass
 from pathlib import Path
+from typing import Any, Literal, Optional, cast
 import torch
 from transformers import T5ForConditionalGeneration
 from ..data.tokenization import Tokenizer
 from ..utils.config import load_yaml
+from .decoder import TransformerDecoder, TransformerDecoderLayer
+from .encoder import TransformerEncoder, TransformerEncoderLayer
 from .heads import ClassificationHead, LMHead
 from .multitask import MultiTaskModel
     """Configuration describing the transformer architecture."""
     d_model: int = 768
+    vocab_size: Optional[int] = None  # Override tokenizer vocab size (e.g., 32128 for FLAN-T5)
     num_encoder_layers: int = 12
     num_decoder_layers: int = 12
     num_attention_heads: int = 12
     use_relative_position_bias: bool = (
         False  # T5-style relative position bias (use True for T5/FLAN-T5)
     )
+    gradient_checkpointing: bool = False
     def __post_init__(self):
         if self.d_model % self.num_attention_heads != 0:
     data = load_yaml(str(path)).data
     return ModelConfig(
         d_model=int(data.get("d_model", 512)),
+        vocab_size=data.get("vocab_size", None),  # Optional vocab size override
         num_encoder_layers=int(data.get("num_encoder_layers", 6)),
         num_decoder_layers=int(data.get("num_decoder_layers", 6)),
         num_attention_heads=int(data.get("num_attention_heads", 8)),
         use_learned_pos_enc=bool(data.get("use_learned_pos_enc", True)),
         activation=str(data.get("activation", "gelu")),
         use_relative_position_bias=bool(data.get("use_relative_position_bias", False)),
+        gradient_checkpointing=bool(data.get("gradient_checkpointing", False)),
     )
       -> We zero-initialize the bias terms
     """
     print(f"Loading pretrained weights from {model_name}...")
+    t5 = T5ForConditionalGeneration.from_pretrained(model_name)  # type: ignore[attr-defined]
     # Load shared embeddings (T5 uses shared embeddings for encoder and decoder)
     # Note: T5's vocab is padded to multiple of 128 for efficiency (32100 -> 32128)
     print("Transferring shared token embeddings...")
     shared_embeddings = t5.shared.weight.data
     our_vocab_size = encoder.embedding.weight.size(0)
         print(f"  Copying first {min_vocab} token embeddings...")
         encoder.embedding.weight.data[:min_vocab].copy_(shared_embeddings[:min_vocab])
         decoder.embedding.weight.data[:min_vocab].copy_(shared_embeddings[:min_vocab])
+        # Initialize any extra tokens (e.g., tokens 32100-32127) with small random values
+        if our_vocab_size > t5_vocab_size:
+            print(
+                f"  Initializing {our_vocab_size - t5_vocab_size} extra padding tokens with small values..."
+            )
+            # Use small random initialization for stability (mean of existing embeddings ±  small noise)
+            mean_emb = shared_embeddings.mean(dim=0, keepdim=True)
+            encoder.embedding.weight.data[t5_vocab_size:].normal_(mean=0.0, std=0.02)
+            encoder.embedding.weight.data[t5_vocab_size:] += mean_emb
+            decoder.embedding.weight.data[t5_vocab_size:].copy_(
+                encoder.embedding.weight.data[t5_vocab_size:]
+            )
     else:
         encoder.embedding.weight.data.copy_(shared_embeddings)
         decoder.embedding.weight.data.copy_(shared_embeddings)
     print("Transferring encoder weights...")
     t5_encoder = t5.encoder
+    for custom_layer_untyped, t5_layer in zip(encoder.layers, t5_encoder.block, strict=False):
+        custom_layer = cast(TransformerEncoderLayer, custom_layer_untyped)
+        t5_block = cast(Any, t5_layer)
+        t5_self_attn = t5_block.layer[0].SelfAttention
+        t5_ffn = t5_block.layer[1].DenseReluDense
+        t5_norm1 = t5_block.layer[0].layer_norm
+        t5_norm2 = t5_block.layer[1].layer_norm
         # Self-attention (T5 has no bias in attention projections)
         custom_layer.self_attn.W_Q.weight.data.copy_(t5_self_attn.q.weight.data)
     if hasattr(encoder, "relative_position_bias") and encoder.relative_position_bias is not None:
         print("Transferring encoder relative position bias...")
         t5_enc_rel_bias = (
+            cast(Any, t5_encoder.block[0]).layer[0].SelfAttention.relative_attention_bias.weight.data
         )
         encoder.relative_position_bias.relative_attention_bias.weight.data.copy_(t5_enc_rel_bias)
     print("Transferring decoder weights...")
     t5_decoder = t5.decoder
+    for custom_layer_untyped, t5_layer in zip(decoder.layers, t5_decoder.block, strict=False):
+        custom_layer = cast(TransformerDecoderLayer, custom_layer_untyped)
+        t5_block = cast(Any, t5_layer)
+        t5_self_attn = t5_block.layer[0].SelfAttention
+        t5_cross_attn = t5_block.layer[1].EncDecAttention
+        t5_ffn = t5_block.layer[2].DenseReluDense
+        t5_norm1 = t5_block.layer[0].layer_norm
+        t5_norm2 = t5_block.layer[1].layer_norm
+        t5_norm3 = t5_block.layer[2].layer_norm
         # Self-attention
         custom_layer.self_attn.W_Q.weight.data.copy_(t5_self_attn.q.weight.data)
     ):
         print("Transferring decoder self-attention relative position bias...")
         t5_dec_self_rel_bias = (
+            cast(Any, t5_decoder.block[0]).layer[0].SelfAttention.relative_attention_bias.weight.data
         )
         decoder.self_relative_position_bias.relative_attention_bias.weight.data.copy_(
             t5_dec_self_rel_bias
         print("Transferring decoder cross-attention relative position bias...")
         # Cross-attention relative position bias is in EncDecAttention of first block
         t5_dec_cross_rel_bias = (
+            cast(Any, t5_decoder.block[0]).layer[1].EncDecAttention.relative_attention_bias.weight.data
         )
         decoder.cross_relative_position_bias.relative_attention_bias.weight.data.copy_(
             t5_dec_cross_rel_bias
     num_layers = min(len(encoder.layers), len(llama.model.layers))
     for i in range(num_layers):
+        llama_layer = cast(Any, llama.model.layers[i])
+        enc_layer = cast(TransformerEncoderLayer, encoder.layers[i])
+        dec_layer = cast(TransformerDecoderLayer, decoder.layers[i])
         # --- Self-Attention ---
         # Llama: q_proj, k_proj, v_proj, o_proj
     if hasattr(tokenizer, "config") and hasattr(tokenizer.config, "max_length"):
         max_len = tokenizer.config.max_length
     elif hasattr(tokenizer, "model_max_length"):
+        max_len = cast(Any, tokenizer).model_max_length
     else:
         max_len = 512  # Default fallback
     # Cast activation to the literal type for mypy
     activation = cast(ActivationType, cfg.activation)
+    # Use cfg.vocab_size (32128) instead of tokenizer.vocab_size (32100)
+    # to match FLAN-T5's padded vocabulary
+    vocab_size = cfg.vocab_size if cfg.vocab_size is not None else tokenizer.vocab_size
     encoder = TransformerEncoder(
+        vocab_size=vocab_size,
         d_model=cfg.d_model,
         num_layers=cfg.num_encoder_layers,
         num_heads=cfg.num_attention_heads,
         use_learned_pos_enc=cfg.use_learned_pos_enc,
         activation=activation,
         use_relative_position_bias=cfg.use_relative_position_bias,
+        gradient_checkpointing=cfg.gradient_checkpointing,
     )
     decoder = TransformerDecoder(
+        vocab_size=vocab_size,
         d_model=cfg.d_model,
         num_layers=cfg.num_decoder_layers,
         num_heads=cfg.num_attention_heads,
         use_learned_pos_enc=cfg.use_learned_pos_enc,
         activation=activation,
         use_relative_position_bias=cfg.use_relative_position_bias,
+        gradient_checkpointing=cfg.gradient_checkpointing,
     )
     # Load pretrained weights if requested (but allow override for inference)
             )
             _load_pretrained_weights(encoder, decoder, cfg.pretrained_model_name)
+    # T5 uses separate embeddings and lm_head (tie_word_embeddings=False)
+    # Both are initialized from pretrained weights if use_pretrained=True
+    # We do NOT tie them here - they remain independent for better flexibility
     model = MultiTaskModel(encoder=encoder, decoder=decoder, decoder_outputs_logits=True)
     model.add_head(
         "summarization",
+        LMHead(d_model=cfg.d_model, vocab_size=vocab_size, tie_embedding=decoder.embedding),
     )
     model.add_head(
         "emotion",

src/models/t5_layer_norm.py ADDED Viewed

	@@ -0,0 +1,41 @@

+"""T5-style Layer Normalization (RMSNorm without mean centering).
+T5 uses a variant of RMSNorm that does NOT subtract the mean.
+This is critical for matching T5's behavior.
+"""
+import torch
+import torch.nn as nn
+class T5LayerNorm(nn.Module):
+    """
+    T5-style layer normalization without mean centering.
+    This is similar to RMSNorm but does NOT subtract the mean from x.
+    Formula: output = x / sqrt(mean(x^2) + eps) * weight
+    Args:
+        normalized_shape: Input shape (typically d_model)
+        eps: Small constant for numerical stability
+    """
+    def __init__(self, normalized_shape: int, eps: float = 1e-6):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(normalized_shape))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            hidden_states: (*, normalized_shape)
+        Returns:
+            Normalized tensor of same shape
+        """
+        # T5 uses variance = mean(x^2), does NOT subtract mean
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        # Scale by learned weight (no bias in T5)
+        return self.weight * hidden_states

src/training/early_stopping.py ADDED Viewed

	@@ -0,0 +1,60 @@

+"""Early stopping implementation for training.
+Author: Oliver Perrin
+Date: December 2025
+"""
+class EarlyStopping:
+    """Stop training when validation loss stops improving.
+    Args:
+        patience: Number of epochs to wait before stopping
+        min_delta: Minimum change to qualify as improvement
+        mode: 'min' for loss (lower is better), 'max' for accuracy
+    """
+    def __init__(
+        self,
+        patience: int = 3,
+        min_delta: float = 0.001,
+        mode: str = "min"
+    ):
+        self.patience = patience
+        self.min_delta = min_delta
+        self.mode = mode
+        self.counter = 0
+        self.best_value = float('inf') if mode == 'min' else float('-inf')
+        self.early_stop = False
+    def __call__(self, metric_value: float) -> bool:
+        """Check if training should stop.
+        Args:
+            metric_value: Current metric value (e.g., validation loss)
+        Returns:
+            True if training should stop, False otherwise
+        """
+        if self.mode == 'min':
+            improved = metric_value < (self.best_value - self.min_delta)
+        else:
+            improved = metric_value > (self.best_value + self.min_delta)
+        if improved:
+            self.best_value = metric_value
+            self.counter = 0
+            return False
+        self.counter += 1
+        if self.counter >= self.patience:
+            self.early_stop = True
+            return True
+        return False
+    def reset(self):
+        """Reset early stopping state."""
+        self.counter = 0
+        self.best_value = float('inf') if self.mode == 'min' else float('-inf')
+        self.early_stop = False

src/training/gradient_monitor.py ADDED Viewed

	@@ -0,0 +1,102 @@

+"""Gradient monitoring utilities.
+Author: Oliver Perrin
+Date: December 2025
+"""
+from typing import Dict, Optional
+import torch
+import torch.nn as nn
+class GradientMonitor:
+    """Monitor gradient statistics during training.
+    Tracks gradient norms, helps detect gradient issues like vanishing/exploding.
+    """
+    def __init__(self, model: nn.Module, log_frequency: int = 100):
+        """Initialize gradient monitor.
+        Args:
+            model: Model to monitor
+            log_frequency: Log gradients every N steps
+        """
+        self.model = model
+        self.log_frequency = log_frequency
+        self.step_count = 0
+    def compute_grad_norm(self) -> Dict[str, float]:
+        """Compute gradient norm statistics.
+        Returns:
+            Dictionary with gradient statistics
+        """
+        total_norm = 0.0
+        max_norm = 0.0
+        num_params = 0
+        for p in self.model.parameters():
+            if p.grad is not None:
+                param_norm = p.grad.data.norm(2).item()
+                total_norm += param_norm ** 2
+                max_norm = max(max_norm, param_norm)
+                num_params += 1
+        total_norm = total_norm ** 0.5
+        return {
+            "grad_norm": total_norm,
+            "grad_norm_max": max_norm,
+            "num_params_with_grad": num_params,
+        }
+    def check_gradients(self) -> Dict[str, int]:
+        """Check for gradient issues (NaN, Inf, zero).
+        Returns:
+            Dictionary with counts of gradient issues
+        """
+        nan_count = 0
+        inf_count = 0
+        zero_count = 0
+        for p in self.model.parameters():
+            if p.grad is not None:
+                if torch.isnan(p.grad).any():
+                    nan_count += 1
+                if torch.isinf(p.grad).any():
+                    inf_count += 1
+                if (p.grad == 0).all():
+                    zero_count += 1
+        return {
+            "nan_grads": nan_count,
+            "inf_grads": inf_count,
+            "zero_grads": zero_count,
+        }
+    def log_gradients(self, step: Optional[int] = None) -> Optional[Dict[str, float]]:
+        """Log gradient statistics if it's time.
+        Args:
+            step: Current training step (uses internal counter if None)
+        Returns:
+            Gradient statistics if logged, None otherwise
+        """
+        if step is None:
+            step = self.step_count
+            self.step_count += 1
+        if step % self.log_frequency == 0:
+            stats = self.compute_grad_norm()
+            issues = self.check_gradients()
+            # Combine stats
+            all_stats = {**stats, **issues}
+            return all_stats
+        return None

src/training/safe_compile.py CHANGED Viewed

@@ -1,86 +1,52 @@
-"""
-Safe torch.compile configuration that prevents NaN issues.
-Author: Oliver Perrin
-Date: December 2025
-"""
 import torch
 def compile_model_safe(
     model: torch.nn.Module,
     mode: str = "default",
 ) -> torch.nn.Module:
     """
-    Compile model with inductor backend and safety guardrails.
-    Uses 'default' mode which gives inductor speedups without CUDA graphs.
-    CUDA graphs (reduce-overhead mode) don't work with dynamic shapes or
-    shared embeddings like in T5.
-    Args:
-        model: Model to compile
-        mode: Compilation mode ("default" recommended, avoid "reduce-overhead")
-    Returns:
-        Compiled model (or original if compilation fails)
-    """
-    if not torch.cuda.is_available():
-        print("⚠ CUDA not available, skipping compilation")
-        return model
-    try:
-        # Configure for stability
-        torch._dynamo.config.suppress_errors = True
-        torch._dynamo.config.cache_size_limit = 64  # Allow more graph variations
-        # Disable aggressive optimizations that can cause NaNs
-        if hasattr(torch, "_inductor"):
-            cfg = torch._inductor.config
-            if hasattr(cfg, "epilogue_fusion"):
-                cfg.epilogue_fusion = False
-            if hasattr(cfg, "coordinate_descent_tuning"):
-                cfg.coordinate_descent_tuning = False
-            if hasattr(cfg, "force_fuse_int_mm_with_mul"):
-                cfg.force_fuse_int_mm_with_mul = False
-            # Explicitly disable CUDA graphs
-            if hasattr(cfg, "triton"):
-                if hasattr(cfg.triton, "cudagraphs"):
-                    cfg.triton.cudagraphs = False
-                if hasattr(cfg.triton, "max_autotune_gemm"):
-                    cfg.triton.max_autotune_gemm = False
-        # Compile with inductor (no CUDA graphs)
-        compiled = torch.compile(model, mode=mode, fullgraph=False, dynamic=True)
-        print(f"✓ Compiled with inductor ({mode} mode)")
-        return compiled
-    except Exception as e:
-        print(f"⚠ Inductor compilation failed: {e}")
-        print("  Falling back to aot_eager")
-        try:
-            return torch.compile(model, backend="aot_eager")
-        except Exception:
-            print("  Using uncompiled model")
-            return model
-def apply_safe_config():
-    """Apply safe configuration to torch._inductor before any compilation."""
-    if hasattr(torch, "_inductor"):
-        cfg = torch._inductor.config
-        if hasattr(cfg, "epilogue_fusion"):
-            cfg.epilogue_fusion = False
-        if hasattr(cfg, "coordinate_descent_tuning"):
-            cfg.coordinate_descent_tuning = False
-        if hasattr(cfg, "triton"):
-            if hasattr(cfg.triton, "cudagraphs"):
-                cfg.triton.cudagraphs = False
-            if hasattr(cfg.triton, "max_autotune_gemm"):
-                cfg.triton.max_autotune_gemm = False
-    # Dynamo config for stability
-    torch._dynamo.config.suppress_errors = True
-    torch._dynamo.config.cache_size_limit = 64
     print("✓ Applied safe inductor configuration")

+"""Safe defaults for `torch.compile` to reduce instability in tests and training."""
+from __future__ import annotations
+from typing import Any
 import torch
+def _set_attr(obj: object, name: str, value: Any) -> None:
+    """Set attribute on dynamic objects only if it exists (keeps static checkers quiet)."""
+    target = getattr(obj, name, None)
+    if target is not None:
+        setattr(obj, name, value)
 def compile_model_safe(
     model: torch.nn.Module,
     mode: str = "default",
+    dynamic: bool | None = None,
 ) -> torch.nn.Module:
+    """Safely compile model with inductor backend.
+    Parameters mirror `torch.compile` but default to conservative settings.
     """
+    return torch.compile(model, backend="inductor", mode=mode, dynamic=dynamic)
+def apply_safe_config() -> None:
+    """Apply conservative torch._inductor and torch._dynamo settings if present."""
+    inductor = getattr(torch, "_inductor", None)
+    cfg = getattr(inductor, "config", None) if inductor is not None else None
+    if cfg is not None:
+        _set_attr(cfg, "epilogue_fusion", False)
+        _set_attr(cfg, "coordinate_descent_tuning", False)
+        triton_cfg = getattr(cfg, "triton", None)
+        if triton_cfg is not None:
+            _set_attr(triton_cfg, "cudagraphs", False)
+            _set_attr(triton_cfg, "max_autotune_gemm", False)
+    dynamo_cfg = getattr(torch, "_dynamo", None)
+    if dynamo_cfg is not None:
+        dyn_config = getattr(dynamo_cfg, "config", None)
+        if dyn_config is not None:
+            _set_attr(dyn_config, "suppress_errors", True)
+            _set_attr(dyn_config, "cache_size_limit", 64)
     print("✓ Applied safe inductor configuration")

src/training/trainer.py CHANGED Viewed

@@ -2,7 +2,7 @@
 Multi-task Trainer for LexiMind.
 Handles training across summarization, emotion, and topic heads with mixed-precision,
-gradient accumulation, and MLflow logging.
 Author: Oliver Perrin
 Date: December 2025
@@ -10,6 +10,7 @@ Date: December 2025
 from __future__ import annotations
 import sys
 import time
 from collections import defaultdict
@@ -19,13 +20,36 @@ from typing import Any, Callable, Dict, List
 import mlflow
 import torch
 import torch.nn.functional as F
 from torch.utils.data import DataLoader
 from tqdm import tqdm
 from ..data.tokenization import Tokenizer
 from .metrics import accuracy, multilabel_f1, rouge_like
 from .nan_debugger import NaNDetector
 # --------------- Configuration ---------------
@@ -42,6 +66,15 @@ class TrainerConfig:
     experiment_name: str = "LexiMind"
     run_name: str | None = None
     gradient_accumulation_steps: int = 1
 # --------------- Trainer ---------------
@@ -61,6 +94,8 @@ class Trainer:
         self.config = config
         self.device = device
         self.tokenizer = tokenizer
         # Task losses
         self.emotion_loss = torch.nn.BCEWithLogitsLoss()
@@ -76,6 +111,18 @@ class Trainer:
         self.nan_skip_count = 0
         self.max_nan_skips = 50
         # Track current step for debugging
         self._current_step = 0
@@ -87,6 +134,46 @@ class Trainer:
             torch.backends.cuda.enable_flash_sdp(True)
             torch.backends.cuda.enable_mem_efficient_sdp(True)
     # --------------- Training Loop ---------------
     def fit(
@@ -94,17 +181,24 @@ class Trainer:
         train_loaders: Dict[str, DataLoader],
         val_loaders: Dict[str, DataLoader] | None = None,
         checkpoint_callback: Callable | None = None,
     ) -> Dict[str, Dict[str, float]]:
         """Train model across all tasks with progress tracking."""
         history: Dict[str, Dict[str, float]] = {}
         total_start = time.perf_counter()
         with mlflow.start_run(run_name=self.config.run_name):
             self._log_config()
             # Epoch progress bar
             epoch_pbar = tqdm(
-                range(1, self.config.max_epochs + 1),
                 desc="Training",
                 unit="epoch",
                 position=0,
@@ -129,6 +223,15 @@ class Trainer:
                     if "summarization" in val_loaders:
                         self._validate_generation(val_loaders["summarization"], epoch)
                 # Checkpoint
                 if checkpoint_callback:
                     checkpoint_callback(epoch, self.model, history)
@@ -256,7 +359,19 @@ class Trainer:
         return averaged
     def _optimizer_step(self) -> None:
-        """Optimizer step with gradient clipping and NaN detection."""
         # Check gradients for NaN/Inf BEFORE clipping
         nan_grad = self.nan_detector.check_gradients(self._current_step)
         if nan_grad is not None:
@@ -280,6 +395,14 @@ class Trainer:
         self.optimizer.zero_grad()
         # Check parameters for NaN AFTER update
         nan_param = self.nan_detector.check_parameters(self._current_step)
         if nan_param is not None:
@@ -287,6 +410,31 @@ class Trainer:
                 f"NaN in parameter {nan_param} after optimizer step at step {self._current_step}!"
             )
     def _get_batch(
         self, iterators: Dict, loader: DataLoader, task: str
     ) -> Dict[str, torch.Tensor] | None:
@@ -341,6 +489,8 @@ class Trainer:
             inputs["src_mask"] = batch["src_mask"]
         logits = self.model.forward("summarization", inputs)
         loss = F.cross_entropy(
             logits.view(-1, logits.size(-1)),
             batch["labels"].view(-1),
@@ -348,6 +498,11 @@ class Trainer:
             label_smoothing=self.config.label_smoothing,
         )
         # Quick ROUGE estimate
         preds = self.tokenizer.decode_batch(logits.argmax(dim=-1).tolist())
         refs = self._decode_labels(batch["labels"])

 Multi-task Trainer for LexiMind.
 Handles training across summarization, emotion, and topic heads with mixed-precision,
+gradient accumulation, gradient monitoring, early stopping, and MLflow logging.
 Author: Oliver Perrin
 Date: December 2025
 from __future__ import annotations
+import math
 import sys
 import time
 from collections import defaultdict
 import mlflow
 import torch
 import torch.nn.functional as F
+from torch.optim.lr_scheduler import LambdaLR
 from torch.utils.data import DataLoader
 from tqdm import tqdm
 from ..data.tokenization import Tokenizer
+from .early_stopping import EarlyStopping
+from .gradient_monitor import GradientMonitor
 from .metrics import accuracy, multilabel_f1, rouge_like
 from .nan_debugger import NaNDetector
+def _get_cosine_schedule_with_warmup(
+    optimizer: torch.optim.Optimizer,
+    num_warmup_steps: int,
+    num_training_steps: int,
+    min_lr_ratio: float = 0.1,
+) -> LambdaLR:
+    """Create cosine LR schedule with linear warmup."""
+    def lr_lambda(current_step: int) -> float:
+        if current_step < num_warmup_steps:
+            return float(current_step) / float(max(1, num_warmup_steps))
+        progress = float(current_step - num_warmup_steps) / float(
+            max(1, num_training_steps - num_warmup_steps)
+        )
+        return max(min_lr_ratio, 0.5 * (1.0 + math.cos(math.pi * progress)))
+    return LambdaLR(optimizer, lr_lambda)
 # --------------- Configuration ---------------
     experiment_name: str = "LexiMind"
     run_name: str | None = None
     gradient_accumulation_steps: int = 1
+    # Learning rate scheduler
+    scheduler_type: str = "cosine"  # "cosine", "linear", or "constant"
+    warmup_steps: int = 0
+    num_training_steps: int = 0  # Set automatically if 0
+    # Early stopping
+    early_stopping_patience: int | None = None  # None = disabled
+    early_stopping_min_delta: float = 0.001
+    # Gradient monitoring
+    log_grad_norm_frequency: int = 100  # Log gradient norms every N steps
 # --------------- Trainer ---------------
         self.config = config
         self.device = device
         self.tokenizer = tokenizer
+        self.scheduler: LambdaLR | None = None  # Set in fit()
+        self.global_step = 0  # Track global step for scheduler
         # Task losses
         self.emotion_loss = torch.nn.BCEWithLogitsLoss()
         self.nan_skip_count = 0
         self.max_nan_skips = 50
+        # Gradient monitoring
+        self.grad_monitor = GradientMonitor(model, log_frequency=config.log_grad_norm_frequency)
+        # Early stopping
+        self.early_stopping: EarlyStopping | None = None
+        if config.early_stopping_patience is not None:
+            self.early_stopping = EarlyStopping(
+                patience=config.early_stopping_patience,
+                min_delta=config.early_stopping_min_delta,
+                mode="min"  # Lower loss is better
+            )
         # Track current step for debugging
         self._current_step = 0
             torch.backends.cuda.enable_flash_sdp(True)
             torch.backends.cuda.enable_mem_efficient_sdp(True)
+    def _setup_scheduler(self, train_loaders: Dict[str, DataLoader], start_epoch: int = 1) -> None:
+        """Initialize learning rate scheduler based on config."""
+        # Calculate steps per epoch once
+        max_batches = max(len(loader) for loader in train_loaders.values())
+        self.steps_per_epoch = max_batches // max(1, self.config.gradient_accumulation_steps)
+        if self.config.scheduler_type == "constant":
+            return  # No scheduler needed
+        # Some tests pass a MagicMock optimizer without param_groups; skip scheduler gracefully
+        try:
+            _ = self.optimizer.param_groups  # type: ignore[attr-defined]
+        except AttributeError:
+            self.scheduler = None
+            return
+        # Calculate total training steps
+        epochs_remaining = max(0, self.config.max_epochs - (start_epoch - 1))
+        num_training_steps = self.config.num_training_steps or (
+            self.steps_per_epoch * epochs_remaining
+        )
+        warmup_steps = self.config.warmup_steps
+        print(
+            f"✓ LR Scheduler: {self.config.scheduler_type} with {warmup_steps} warmup steps, {num_training_steps} total steps"
+        )
+        if self.config.scheduler_type == "cosine":
+            self.scheduler = _get_cosine_schedule_with_warmup(
+                self.optimizer, warmup_steps, num_training_steps
+            )
+        elif self.config.scheduler_type == "linear":
+            def linear_decay(step: int) -> float:
+                if step < warmup_steps:
+                    return float(step) / float(max(1, warmup_steps))
+                return max(0.0, 1.0 - (step - warmup_steps) / (num_training_steps - warmup_steps))
+            self.scheduler = LambdaLR(self.optimizer, linear_decay)
     # --------------- Training Loop ---------------
     def fit(
         train_loaders: Dict[str, DataLoader],
         val_loaders: Dict[str, DataLoader] | None = None,
         checkpoint_callback: Callable | None = None,
+        start_epoch: int = 1,
     ) -> Dict[str, Dict[str, float]]:
         """Train model across all tasks with progress tracking."""
         history: Dict[str, Dict[str, float]] = {}
         total_start = time.perf_counter()
+        # Setup LR scheduler
+        self._setup_scheduler(train_loaders, start_epoch=start_epoch)
+        # Initialize global_step to reflect completed epochs when resuming
+        if hasattr(self, "steps_per_epoch"):
+            self.global_step = max(0, (start_epoch - 1) * self.steps_per_epoch)
         with mlflow.start_run(run_name=self.config.run_name):
             self._log_config()
             # Epoch progress bar
             epoch_pbar = tqdm(
+                range(start_epoch, self.config.max_epochs + 1),
                 desc="Training",
                 unit="epoch",
                 position=0,
                     if "summarization" in val_loaders:
                         self._validate_generation(val_loaders["summarization"], epoch)
+                    # Early stopping check
+                    if self.early_stopping is not None:
+                        val_loss = val_metrics.get("total_loss", val_metrics.get("summarization_loss", float('inf')))
+                        if self.early_stopping(val_loss):
+                            tqdm.write(f"\n⚠ Early stopping triggered at epoch {epoch}")
+                            tqdm.write(f"  Best validation loss: {self.early_stopping.best_value:.4f}")
+                            tqdm.write(f"  Patience exhausted ({self.early_stopping.patience} epochs)")
+                            break
                 # Checkpoint
                 if checkpoint_callback:
                     checkpoint_callback(epoch, self.model, history)
         return averaged
     def _optimizer_step(self) -> None:
+        """Perform optimizer step with gradient clipping."""
+        # Log gradient norms before clipping
+        grad_stats = self.grad_monitor.log_gradients(self.global_step)
+        if grad_stats is not None:
+            tqdm.write(
+                f"  [Step {self.global_step}] "
+                f"Grad norm: {grad_stats['grad_norm']:.4f}, "
+                f"Max: {grad_stats['grad_norm_max']:.4f}"
+            )
+            # Log to MLflow
+            for key, val in grad_stats.items():
+                mlflow.log_metric(f"grad_{key}", val, step=self.global_step)
         # Check gradients for NaN/Inf BEFORE clipping
         nan_grad = self.nan_detector.check_gradients(self._current_step)
         if nan_grad is not None:
         self.optimizer.zero_grad()
+        # Step the learning rate scheduler
+        if self.scheduler is not None:
+            self.scheduler.step()
+            self.global_step += 1
+            # Log learning rate
+            current_lr = self.scheduler.get_last_lr()[0]
+            mlflow.log_metric("learning_rate", current_lr, step=self.global_step)
         # Check parameters for NaN AFTER update
         nan_param = self.nan_detector.check_parameters(self._current_step)
         if nan_param is not None:
                 f"NaN in parameter {nan_param} after optimizer step at step {self._current_step}!"
             )
+    def _clip_embedding_gradients(self, max_norm: float = 5.0) -> None:
+        """Clip embedding gradients only if they exceed threshold.
+        Less aggressive clipping to allow learning while preventing
+        overflow with inductor backend + gradient accumulation.
+        """
+        for name, param in self.model.named_parameters():
+            if param.grad is not None and "embedding" in name.lower():
+                grad = param.grad
+                # Only fix actual NaN/Inf, don't preemptively clip
+                if torch.isnan(grad).any() or torch.isinf(grad).any():
+                    # Count NaNs for monitoring
+                    nan_count = torch.isnan(grad).sum().item()
+                    inf_count = torch.isinf(grad).sum().item()
+                    if nan_count > 0 or inf_count > 0:
+                        # Replace with zeros only where invalid
+                        param.grad = torch.where(
+                            torch.isnan(grad) | torch.isinf(grad), torch.zeros_like(grad), grad
+                        )
+                else:
+                    # Normal gradient - only clip if extremely large
+                    grad_norm = param.grad.norm()
+                    if grad_norm > max_norm:
+                        param.grad = param.grad * (max_norm / (grad_norm + 1e-6))
     def _get_batch(
         self, iterators: Dict, loader: DataLoader, task: str
     ) -> Dict[str, torch.Tensor] | None:
             inputs["src_mask"] = batch["src_mask"]
         logits = self.model.forward("summarization", inputs)
+        # Compute loss with proper masking
         loss = F.cross_entropy(
             logits.view(-1, logits.size(-1)),
             batch["labels"].view(-1),
             label_smoothing=self.config.label_smoothing,
         )
+        # Sanity check logits
+        if self.global_step % 100 == 0:
+            with torch.no_grad():
+                tqdm.write(f"  [Step {self.global_step}] Summarization logits: mean={logits.mean().item():.2f}, std={logits.std().item():.2f}, loss={loss.item():.4f}")
         # Quick ROUGE estimate
         preds = self.tokenizer.decode_batch(logits.argmax(dim=-1).tolist())
         refs = self._decode_labels(batch["labels"])

tests/test_inference/test_pipeline.py CHANGED Viewed

@@ -2,11 +2,18 @@
 from __future__ import annotations
 from pathlib import Path
 from typing import cast
 import torch
 from src.data.tokenization import Tokenizer, TokenizerConfig
 from src.inference.pipeline import (
     EmotionPrediction,
@@ -16,6 +23,21 @@ from src.inference.pipeline import (
 )
 from src.utils.labels import LabelMetadata
 def _local_tokenizer_config() -> TokenizerConfig:
     root = Path(__file__).resolve().parents[2]
@@ -48,7 +70,7 @@ class DummyDecoder(torch.nn.Module):
         device: torch.device,
         **kwargs: object,
     ) -> torch.Tensor:
-        seq = self.sequence.to(device)
         if seq.numel() > max_len:
             seq = seq[:max_len]
         batch = memory.size(0)
@@ -70,9 +92,9 @@ class DummyModel(torch.nn.Module):
     ) -> torch.Tensor:  # pragma: no cover - simple dispatch
         batch = inputs["input_ids"].size(0)
         if task == "emotion":
-            return self._emotion_logits.unsqueeze(0).repeat(batch, 1)
         if task == "topic":
-            return self._topic_logits.unsqueeze(0).repeat(batch, 1)
         raise KeyError(task)
@@ -85,7 +107,7 @@ def _build_pipeline() -> InferencePipeline:
         tokenizer=tokenizer,
         emotion_labels=metadata.emotion,
         topic_labels=metadata.topic,
-        config=InferenceConfig(summary_max_length=12),
     )

 from __future__ import annotations
+import sys
+import warnings
 from pathlib import Path
 from typing import cast
+import pytest
 import torch
+PROJECT_ROOT = Path(__file__).resolve().parents[2]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
 from src.data.tokenization import Tokenizer, TokenizerConfig
 from src.inference.pipeline import (
     EmotionPrediction,
 )
 from src.utils.labels import LabelMetadata
+# Silence noisy DeprecationWarnings from underlying tokenizer bindings used in tests
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings(
+    "ignore",
+    message=r"builtin type SwigPy.*has no __module__ attribute",
+    category=DeprecationWarning,
+)
+warnings.filterwarnings(
+    "ignore",
+    category=DeprecationWarning,
+    module=r"importlib\\._bootstrap",
+)
+pytestmark = pytest.mark.filterwarnings("ignore::DeprecationWarning")
 def _local_tokenizer_config() -> TokenizerConfig:
     root = Path(__file__).resolve().parents[2]
         device: torch.device,
         **kwargs: object,
     ) -> torch.Tensor:
+        seq = cast(torch.Tensor, self.sequence).to(device)
         if seq.numel() > max_len:
             seq = seq[:max_len]
         batch = memory.size(0)
     ) -> torch.Tensor:  # pragma: no cover - simple dispatch
         batch = inputs["input_ids"].size(0)
         if task == "emotion":
+            return cast(torch.Tensor, self._emotion_logits).unsqueeze(0).repeat(batch, 1)
         if task == "topic":
+            return cast(torch.Tensor, self._topic_logits).unsqueeze(0).repeat(batch, 1)
         raise KeyError(task)
         tokenizer=tokenizer,
         emotion_labels=metadata.emotion,
         topic_labels=metadata.topic,
+        config=InferenceConfig(summary_max_length=12, summary_formatting=False),
     )

tests/test_models/test_visualizations.py CHANGED Viewed

@@ -34,7 +34,7 @@ def test_attention_visualization():
     V = torch.eye(seq_len, d_k).unsqueeze(0)  # Identity-like
     # Compute attention
-    output, weights = attention(Q, K, V, return_attn_weights=True)
     # Plot attention weights
     plt.figure(figsize=(8, 6))

     V = torch.eye(seq_len, d_k).unsqueeze(0)  # Identity-like
     # Compute attention
+    _output, weights = attention(Q, K, V, return_attn_weights=True)
     # Plot attention weights
     plt.figure(figsize=(8, 6))