--- license: mit language: - en base_model: - google/vit-base-patch16-224 - FacebookAI/roberta-base tags: - comics - composition - comic - comic-analysis - page - fusion --- The model code and documentation repository is at https://github.com/RichardScottOZ/comic-analysis Using transformers multimodal fusion of image and text to make embeddings to query comics for similarity or text. More more detail the repo above. --- language: en tags: - vision - text - multimodal - comics - contrastive-learning - vit - roberta license: mit --- # ClosureLiteSimple (Version 1 - Comic Panel Encoder) ClosureLiteSimple is the Version 1 precursor to the Stage 3 panel encoder within the [Comic Analysis Framework](https://github.com/RichardScottOZ/Comic-Analysis). It is a multimodal neural network designed to fuse image crops, textual dialogue, and compositional metadata into a unified **384-dimensional** embedding per comic panel, and can also aggregate these panels into a single Page-level embedding using an attention mechanism. *(Note: This model is considered deprecated in favor of the newer `comic-panel-encoder-v1` which utilizes SigLIP, ResNet, and an improved Adaptive Fusion Gate).* ## Model Architecture The `ClosureLiteSimple` model consists of the `PanelAtomizerLite` and a `SimpleAttention` mechanism: 1. **Vision Encoder (`google/vit-base-patch16-224`):** - Extracts features from $224 \times 224$ panel image crops. - Outputs projected to $384$-d. 2. **Text Encoder (`roberta-base`):** - Encodes panel dialogue, narration, or OCR text. - Outputs projected to $384$-d. 3. **Compositional Encoder (MLP):** - Takes a 7-dimensional vector representing the bounding box geometry (e.g., aspect ratio, relative area, normalized center coordinates). - Projects through hidden layers to $384$-d. 4. **Gated Fusion (`GatedFusion`):** - Concatenates the three modality outputs and computes a learned softmax gate. - Outputs a weighted sum of the Vision, Text, and Composition features, resulting in the final $384$-d **Panel Embedding**. 5. **Page Aggregation (`SimpleAttention`):** - Uses multi-head attention to pool the variable number of Panel Embeddings on a single page into a unified $384$-d **Page Embedding**. ## Usage The codebase for this model resides in the `src/version1/` directory of the repository. ### Example: Loading and Inference ```python import torch from PIL import Image import torchvision.transforms as T from transformers import AutoTokenizer # Requires cloning the GitHub repo from closure_lite_simple_framework import ClosureLiteSimple device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # 1. Initialize Model model = ClosureLiteSimple(d=384, num_heads=4, temperature=0.1).to(device) # Load weights from Hugging Face state_dict = torch.hub.load_state_dict_from_url( "https://huggingface.co/RichardScottOZ/closure-lite-simple/resolve/main/best_model.pt", map_location=device ) if 'model_state_dict' in state_dict: state_dict = state_dict['model_state_dict'] model.load_state_dict(state_dict) model.eval() # 2. Prepare Inputs (Example: A page with 2 panels) transform = T.Compose([ T.Resize((224, 224)), T.ToTensor(), T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) # Dummy Image Crops (B=1 page, N=2 panels, C=3, H=224, W=224) images = torch.stack([ transform(Image.new('RGB', (224, 224))), transform(Image.new('RGB', (224, 224))) ]).unsqueeze(0).to(device) # Dummy Text tokenizer = AutoTokenizer.from_pretrained("roberta-base") text_enc = tokenizer(["Panel 1 text", "Panel 2 text"], return_tensors='pt', padding=True) input_ids = text_enc['input_ids'].unsqueeze(0).to(device) attention_mask = text_enc['attention_mask'].unsqueeze(0).to(device) # Dummy Composition (B=1, N=2, F=7) comp_feats = torch.zeros((1, 2, 7)).to(device) # Valid Panel Mask (B=1, N=2) panel_mask = torch.tensor([[True, True]]).to(device) # 3. Generate Embeddings with torch.no_grad(): panel_embeddings, page_embedding = model( images, input_ids, attention_mask, comp_feats, panel_mask ) print(f"Panel Embeddings Shape: {panel_embeddings.shape}") # (1, 2, 384) print(f"Page Embedding Shape: {page_embedding.shape}") # (1, 384) ``` ## Intended Use & Limitations - **Intended Use:** Originally designed for exploring multimodal embedding spaces and building basic visual/textual retrieval prototypes (like CoMiX v1). - **Limitations:** - **Modality Dominance:** Analysis of this model revealed that if one modality (e.g., text) was missing or uninformative during inference, the `GatedFusion` mechanism struggled to fall back gracefully to the visual features, often resulting in collapsed or non-discriminative embeddings for single-modality queries. - **Deprecated:** This architecture has been superseded by Stage 3 (`comic-panel-encoder-v1`), which utilizes independent modality projection and a masked Adaptive Fusion gate to solve the dominance issues. ## Citation Please reference the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) when utilizing this architecture.