File size: 5,162 Bytes
992296a
 
 
 
a0f3a8c
 
 
 
 
 
 
 
 
 
143ce55
 
1c2768d
 
 
 
f97f985
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
license: mit
language:
- en
base_model:
- google/vit-base-patch16-224
- FacebookAI/roberta-base
tags:
- comics
- composition
- comic
- comic-analysis
- page
- fusion
---

The model code and documentation repository is at https://github.com/RichardScottOZ/comic-analysis

Using transformers multimodal fusion of image and text to make embeddings to query comics for similarity or text.

More more detail the repo above.

---
language: en
tags:
- vision
- text
- multimodal
- comics
- contrastive-learning
- vit
- roberta
license: mit
---

# ClosureLiteSimple (Version 1 - Comic Panel Encoder)

ClosureLiteSimple is the Version 1 precursor to the Stage 3 panel encoder within the [Comic Analysis Framework](https://github.com/RichardScottOZ/Comic-Analysis).

It is a multimodal neural network designed to fuse image crops, textual dialogue, and compositional metadata into a unified **384-dimensional** embedding per comic panel, and can also aggregate these panels into a single Page-level embedding using an attention mechanism.

*(Note: This model is considered deprecated in favor of the newer `comic-panel-encoder-v1` which utilizes SigLIP, ResNet, and an improved Adaptive Fusion Gate).*

## Model Architecture

The `ClosureLiteSimple` model consists of the `PanelAtomizerLite` and a `SimpleAttention` mechanism:

1. **Vision Encoder (`google/vit-base-patch16-224`):**
   - Extracts features from $224 \times 224$ panel image crops.
   - Outputs projected to $384$-d.
2. **Text Encoder (`roberta-base`):**
   - Encodes panel dialogue, narration, or OCR text.
   - Outputs projected to $384$-d.
3. **Compositional Encoder (MLP):**
   - Takes a 7-dimensional vector representing the bounding box geometry (e.g., aspect ratio, relative area, normalized center coordinates).
   - Projects through hidden layers to $384$-d.
4. **Gated Fusion (`GatedFusion`):**
   - Concatenates the three modality outputs and computes a learned softmax gate.
   - Outputs a weighted sum of the Vision, Text, and Composition features, resulting in the final $384$-d **Panel Embedding**.
5. **Page Aggregation (`SimpleAttention`):**
   - Uses multi-head attention to pool the variable number of Panel Embeddings on a single page into a unified $384$-d **Page Embedding**.

## Usage

The codebase for this model resides in the `src/version1/` directory of the repository.

### Example: Loading and Inference

```python
import torch
from PIL import Image
import torchvision.transforms as T
from transformers import AutoTokenizer

# Requires cloning the GitHub repo
from closure_lite_simple_framework import ClosureLiteSimple

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1. Initialize Model
model = ClosureLiteSimple(d=384, num_heads=4, temperature=0.1).to(device)

# Load weights from Hugging Face
state_dict = torch.hub.load_state_dict_from_url(
    "https://huggingface.co/RichardScottOZ/closure-lite-simple/resolve/main/best_model.pt",
    map_location=device
)
if 'model_state_dict' in state_dict:
    state_dict = state_dict['model_state_dict']
model.load_state_dict(state_dict)
model.eval()

# 2. Prepare Inputs (Example: A page with 2 panels)
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Dummy Image Crops (B=1 page, N=2 panels, C=3, H=224, W=224)
images = torch.stack([
    transform(Image.new('RGB', (224, 224))),
    transform(Image.new('RGB', (224, 224)))
]).unsqueeze(0).to(device)

# Dummy Text
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
text_enc = tokenizer(["Panel 1 text", "Panel 2 text"], return_tensors='pt', padding=True)
input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
attention_mask = text_enc['attention_mask'].unsqueeze(0).to(device)

# Dummy Composition (B=1, N=2, F=7)
comp_feats = torch.zeros((1, 2, 7)).to(device)

# Valid Panel Mask (B=1, N=2)
panel_mask = torch.tensor([[True, True]]).to(device)

# 3. Generate Embeddings
with torch.no_grad():
    panel_embeddings, page_embedding = model(
        images, input_ids, attention_mask, comp_feats, panel_mask
    )

print(f"Panel Embeddings Shape: {panel_embeddings.shape}") # (1, 2, 384)
print(f"Page Embedding Shape: {page_embedding.shape}")     # (1, 384)
```

## Intended Use & Limitations
- **Intended Use:** Originally designed for exploring multimodal embedding spaces and building basic visual/textual retrieval prototypes (like CoMiX v1).
- **Limitations:** 
  - **Modality Dominance:** Analysis of this model revealed that if one modality (e.g., text) was missing or uninformative during inference, the `GatedFusion` mechanism struggled to fall back gracefully to the visual features, often resulting in collapsed or non-discriminative embeddings for single-modality queries. 
  - **Deprecated:** This architecture has been superseded by Stage 3 (`comic-panel-encoder-v1`), which utilizes independent modality projection and a masked Adaptive Fusion gate to solve the dominance issues.

## Citation
Please reference the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) when utilizing this architecture.