File size: 3,384 Bytes
bf409ec
a08c225
2ea6e78
 
 
bf409ec
 
2ea6e78
bf409ec
 
a08c225
2ea6e78
 
 
 
 
a08c225
2ea6e78
 
 
a08c225
2ea6e78
 
 
 
 
 
a08c225
 
2ea6e78
 
 
a08c225
 
 
2ea6e78
a08c225
 
 
 
 
 
2ea6e78
 
 
 
 
 
 
 
 
 
 
 
a08c225
 
 
 
2ea6e78
a08c225
 
 
 
2ea6e78
 
 
a08c225
 
 
 
 
 
2ea6e78
 
 
 
a08c225
 
 
2ea6e78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
title: DOAB Metadata Extraction Evaluation
emoji: πŸ“š
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
---

# VLM vs Text: Extracting Metadata from Book Covers

**Can Vision-Language Models extract metadata from book covers better than text extraction?**

## TL;DR

**Yes, significantly.** VLMs achieve ~97% accuracy vs ~75% for text extraction on title extraction, and ~80% vs ~71% on full metadata extraction.

## The Task

Extracting metadata from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:

1. **VLM (Vision)**: Send the cover image directly to a Vision-Language Model
2. **Text Extraction**: Extract text from the image first, then send to an LLM

## Results

### Title Extraction (simpler task)

| Approach | Average Accuracy |
|----------|-----------------|
| **VLM** | **97%** |
| Text | 75% |

### Full Metadata (title, subtitle, publisher, year, ISBN)

| Approach | Average Accuracy |
|----------|-----------------|
| **VLM** | **80%** |
| Text | 71% |

VLMs consistently outperform text extraction across both tasks.

### Why VLMs Win

Book covers are **visually structured**:
- Titles appear in specific locations (usually top/center)
- Typography indicates importance (larger = more likely title)
- Layout provides context that pure text loses

Text extraction flattens this structure, losing valuable spatial information.

## Models Evaluated

**VLM Models**:
- Qwen3-VL-8B-Instruct (8B params)
- Qwen3-VL-30B-A3B-Thinking (30B params)
- GLM-4.6V-Flash (9B params)

**Text Models**:
- gpt-oss-20b (20B params)
- Qwen3-4B-Instruct-2507 (4B params)
- Olmo-3-7B-Instruct (7B params)

**Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.

## Interactive Features

- **Task selector**: Switch between Title Extraction and Full Metadata results
- **Model size vs accuracy plot**: Interactive scatter plot showing efficiency
- **Leaderboard**: Filter by VLM or Text approach

## Technical Details

- **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples)
- **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
- **Scoring**:
  - Title: Flexible matching (handles case, subtitles, punctuation)
  - Full Metadata: LLM-as-judge with partial credit
- **Logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)

## Replicate This

The evaluation logs are stored on HuggingFace and can be loaded directly:

```python
from inspect_ai.analysis import evals_df

df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals")
```

## Why This Matters for GLAM

Libraries and archives have millions of digitized documents where metadata is incomplete or missing. VLMs offer a promising approach for:

- **Catalog enhancement**: Fill gaps in existing records
- **Discovery**: Make collections more searchable
- **Quality assessment**: Validate existing metadata

This evaluation demonstrates that domain-specific benchmarks can help identify the best approaches for cultural heritage tasks.

---

*Built with [Marimo](https://marimo.io) | Evaluation framework: [Inspect AI](https://inspect.aisi.org.uk/)*