File size: 5,162 Bytes
992296a a0f3a8c 143ce55 1c2768d f97f985 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | ---
license: mit
language:
- en
base_model:
- google/vit-base-patch16-224
- FacebookAI/roberta-base
tags:
- comics
- composition
- comic
- comic-analysis
- page
- fusion
---
The model code and documentation repository is at https://github.com/RichardScottOZ/comic-analysis
Using transformers multimodal fusion of image and text to make embeddings to query comics for similarity or text.
More more detail the repo above.
---
language: en
tags:
- vision
- text
- multimodal
- comics
- contrastive-learning
- vit
- roberta
license: mit
---
# ClosureLiteSimple (Version 1 - Comic Panel Encoder)
ClosureLiteSimple is the Version 1 precursor to the Stage 3 panel encoder within the [Comic Analysis Framework](https://github.com/RichardScottOZ/Comic-Analysis).
It is a multimodal neural network designed to fuse image crops, textual dialogue, and compositional metadata into a unified **384-dimensional** embedding per comic panel, and can also aggregate these panels into a single Page-level embedding using an attention mechanism.
*(Note: This model is considered deprecated in favor of the newer `comic-panel-encoder-v1` which utilizes SigLIP, ResNet, and an improved Adaptive Fusion Gate).*
## Model Architecture
The `ClosureLiteSimple` model consists of the `PanelAtomizerLite` and a `SimpleAttention` mechanism:
1. **Vision Encoder (`google/vit-base-patch16-224`):**
- Extracts features from $224 \times 224$ panel image crops.
- Outputs projected to $384$-d.
2. **Text Encoder (`roberta-base`):**
- Encodes panel dialogue, narration, or OCR text.
- Outputs projected to $384$-d.
3. **Compositional Encoder (MLP):**
- Takes a 7-dimensional vector representing the bounding box geometry (e.g., aspect ratio, relative area, normalized center coordinates).
- Projects through hidden layers to $384$-d.
4. **Gated Fusion (`GatedFusion`):**
- Concatenates the three modality outputs and computes a learned softmax gate.
- Outputs a weighted sum of the Vision, Text, and Composition features, resulting in the final $384$-d **Panel Embedding**.
5. **Page Aggregation (`SimpleAttention`):**
- Uses multi-head attention to pool the variable number of Panel Embeddings on a single page into a unified $384$-d **Page Embedding**.
## Usage
The codebase for this model resides in the `src/version1/` directory of the repository.
### Example: Loading and Inference
```python
import torch
from PIL import Image
import torchvision.transforms as T
from transformers import AutoTokenizer
# Requires cloning the GitHub repo
from closure_lite_simple_framework import ClosureLiteSimple
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 1. Initialize Model
model = ClosureLiteSimple(d=384, num_heads=4, temperature=0.1).to(device)
# Load weights from Hugging Face
state_dict = torch.hub.load_state_dict_from_url(
"https://huggingface.co/RichardScottOZ/closure-lite-simple/resolve/main/best_model.pt",
map_location=device
)
if 'model_state_dict' in state_dict:
state_dict = state_dict['model_state_dict']
model.load_state_dict(state_dict)
model.eval()
# 2. Prepare Inputs (Example: A page with 2 panels)
transform = T.Compose([
T.Resize((224, 224)),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Dummy Image Crops (B=1 page, N=2 panels, C=3, H=224, W=224)
images = torch.stack([
transform(Image.new('RGB', (224, 224))),
transform(Image.new('RGB', (224, 224)))
]).unsqueeze(0).to(device)
# Dummy Text
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
text_enc = tokenizer(["Panel 1 text", "Panel 2 text"], return_tensors='pt', padding=True)
input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
attention_mask = text_enc['attention_mask'].unsqueeze(0).to(device)
# Dummy Composition (B=1, N=2, F=7)
comp_feats = torch.zeros((1, 2, 7)).to(device)
# Valid Panel Mask (B=1, N=2)
panel_mask = torch.tensor([[True, True]]).to(device)
# 3. Generate Embeddings
with torch.no_grad():
panel_embeddings, page_embedding = model(
images, input_ids, attention_mask, comp_feats, panel_mask
)
print(f"Panel Embeddings Shape: {panel_embeddings.shape}") # (1, 2, 384)
print(f"Page Embedding Shape: {page_embedding.shape}") # (1, 384)
```
## Intended Use & Limitations
- **Intended Use:** Originally designed for exploring multimodal embedding spaces and building basic visual/textual retrieval prototypes (like CoMiX v1).
- **Limitations:**
- **Modality Dominance:** Analysis of this model revealed that if one modality (e.g., text) was missing or uninformative during inference, the `GatedFusion` mechanism struggled to fall back gracefully to the visual features, often resulting in collapsed or non-discriminative embeddings for single-modality queries.
- **Deprecated:** This architecture has been superseded by Stage 3 (`comic-panel-encoder-v1`), which utilizes independent modality projection and a masked Adaptive Fusion gate to solve the dominance issues.
## Citation
Please reference the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) when utilizing this architecture.
|