OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models
Abstract
OceanPile presents a large-scale multimodal corpus for ocean science, combining diverse data types and a knowledge graph-guided instruction dataset to advance marine AI applications.
The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBenchmark, a manually curated evaluation benchmark for rigorous assessment. We establish a multi-stage quality control process to ensure scientific validity and alignment across modalities. Experimental validation demonstrates significant performance improvements for models trained on our data. All datasets are publicly released to advance the field of marine artificial intelligence and empower domain-specific MLLMs.
Community
OceanPile is a large multimodal ocean dataset that brings together sonar, images, and scientific knowledge to help AI better understand our ocean.
the part that hooked me most is the ocean concept knowledge graph guiding the synthesis of ocean instruction. i'd love to see an ablation removing the knowledge-graph guidance and letting the model generate instruction purely from data, to quantify how much the graph actually contributes versus noisy signals. my worry is cross-modal alignment across sonar time series and underwater imagery; if the graph encodes high level concepts, how do you ensure fine grained semantics transfer across such heterogeneous modalities under weak labeling. btw, the arxivlens breakdown helped me parse the method details and gives a nice walk-through of the pipeline: https://arxivlens.com/PaperView/Details/oceanpile-a-large-scale-multimodal-ocean-corpus-for-foundation-models-1877-a2c738c4. overall, promising direction, but the real test will be how this scales to truly noisy field data and whether instruction tuning improves downstream generalization beyond these benchmarks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing (2026)
- LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration (2026)
- All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding (2026)
- MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos (2026)
- AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models (2026)
- OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence (2026)
- HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.00877 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper