File size: 5,162 Bytes

---
license: mit
language:
- en
base_model:
- google/vit-base-patch16-224
- FacebookAI/roberta-base
tags:
- comics
- composition
- comic
- comic-analysis
- page
- fusion
---

The model code and documentation repository is at https://github.com/RichardScottOZ/comic-analysis

Using transformers multimodal fusion of image and text to make embeddings to query comics for similarity or text.

More more detail the repo above.

---
language: en
tags:
- vision
- text
- multimodal
- comics
- contrastive-learning
- vit
- roberta
license: mit
---

# ClosureLiteSimple (Version 1 - Comic Panel Encoder)

ClosureLiteSimple is the Version 1 precursor to the Stage 3 panel encoder within the [Comic Analysis Framework](https://github.com/RichardScottOZ/Comic-Analysis).

It is a multimodal neural network designed to fuse image crops, textual dialogue, and compositional metadata into a unified **384-dimensional** embedding per comic panel, and can also aggregate these panels into a single Page-level embedding using an attention mechanism.

*(Note: This model is considered deprecated in favor of the newer `comic-panel-encoder-v1` which utilizes SigLIP, ResNet, and an improved Adaptive Fusion Gate).*

## Model Architecture

The `ClosureLiteSimple` model consists of the `PanelAtomizerLite` and a `SimpleAttention` mechanism:

1. **Vision Encoder (`google/vit-base-patch16-224`):**
   - Extracts features from $224 \times 224$ panel image crops.
   - Outputs projected to $384$-d.
2. **Text Encoder (`roberta-base`):**
   - Encodes panel dialogue, narration, or OCR text.
   - Outputs projected to $384$-d.
3. **Compositional Encoder (MLP):**
   - Takes a 7-dimensional vector representing the bounding box geometry (e.g., aspect ratio, relative area, normalized center coordinates).
   - Projects through hidden layers to $384$-d.
4. **Gated Fusion (`GatedFusion`):**
   - Concatenates the three modality outputs and computes a learned softmax gate.
   - Outputs a weighted sum of the Vision, Text, and Composition features, resulting in the final $384$-d **Panel Embedding**.
5. **Page Aggregation (`SimpleAttention`):**
   - Uses multi-head attention to pool the variable number of Panel Embeddings on a single page into a unified $384$-d **Page Embedding**.

## Usage

The codebase for this model resides in the `src/version1/` directory of the repository.

### Example: Loading and Inference

```python
import torch
from PIL import Image
import torchvision.transforms as T
from transformers import AutoTokenizer

# Requires cloning the GitHub repo
from closure_lite_simple_framework import ClosureLiteSimple

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1. Initialize Model
model = ClosureLiteSimple(d=384, num_heads=4, temperature=0.1).to(device)

# Load weights from Hugging Face
state_dict = torch.hub.load_state_dict_from_url(
    "https://huggingface.co/RichardScottOZ/closure-lite-simple/resolve/main/best_model.pt",
    map_location=device
)
if 'model_state_dict' in state_dict:
    state_dict = state_dict['model_state_dict']
model.load_state_dict(state_dict)
model.eval()

# 2. Prepare Inputs (Example: A page with 2 panels)
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Dummy Image Crops (B=1 page, N=2 panels, C=3, H=224, W=224)
images = torch.stack([
    transform(Image.new('RGB', (224, 224))),
    transform(Image.new('RGB', (224, 224)))
]).unsqueeze(0).to(device)

# Dummy Text
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
text_enc = tokenizer(["Panel 1 text", "Panel 2 text"], return_tensors='pt', padding=True)
input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
attention_mask = text_enc['attention_mask'].unsqueeze(0).to(device)

# Dummy Composition (B=1, N=2, F=7)
comp_feats = torch.zeros((1, 2, 7)).to(device)

# Valid Panel Mask (B=1, N=2)
panel_mask = torch.tensor([[True, True]]).to(device)

# 3. Generate Embeddings
with torch.no_grad():
    panel_embeddings, page_embedding = model(
        images, input_ids, attention_mask, comp_feats, panel_mask
    )

print(f"Panel Embeddings Shape: {panel_embeddings.shape}") # (1, 2, 384)
print(f"Page Embedding Shape: {page_embedding.shape}")     # (1, 384)
```

## Intended Use & Limitations
- **Intended Use:** Originally designed for exploring multimodal embedding spaces and building basic visual/textual retrieval prototypes (like CoMiX v1).
- **Limitations:** 
  - **Modality Dominance:** Analysis of this model revealed that if one modality (e.g., text) was missing or uninformative during inference, the `GatedFusion` mechanism struggled to fall back gracefully to the visual features, often resulting in collapsed or non-discriminative embeddings for single-modality queries. 
  - **Deprecated:** This architecture has been superseded by Stage 3 (`comic-panel-encoder-v1`), which utilizes independent modality projection and a masked Adaptive Fusion gate to solve the dominance issues.

## Citation
Please reference the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) when utilizing this architecture.