arxiv:2605.00877

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

Published on Apr 25

· Submitted by

Ningyu Zhang on May 5

Zhejiang University

Upvote

Authors:

Abstract

OceanPile presents a large-scale multimodal corpus for ocean science, combining diverse data types and a knowledge graph-guided instruction dataset to advance marine AI applications.

AI-generated summary

The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBenchmark, a manually curated evaluation benchmark for rigorous assessment. We establish a multi-stage quality control process to ensure scientific validity and alignment across modalities. Experimental validation demonstrates significant performance improvements for models trained on our data. All datasets are publicly released to advance the field of marine artificial intelligence and empower domain-specific MLLMs.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

Ningyu

Paper submitter 1 day ago

OceanPile is a large multimodal ocean dataset that brings together sonar, images, and scientific knowledge to help AI better understand our ocean.

avahal

about 13 hours ago

the part that hooked me most is the ocean concept knowledge graph guiding the synthesis of ocean instruction. i'd love to see an ablation removing the knowledge-graph guidance and letting the model generate instruction purely from data, to quantify how much the graph actually contributes versus noisy signals. my worry is cross-modal alignment across sonar time series and underwater imagery; if the graph encodes high level concepts, how do you ensure fine grained semantics transfer across such heterogeneous modalities under weak labeling. btw, the arxivlens breakdown helped me parse the method details and gives a nice walk-through of the pipeline: https://arxivlens.com/PaperView/Details/oceanpile-a-large-scale-multimodal-ocean-corpus-for-foundation-models-1877-a2c738c4. overall, promising direction, but the real test will be how this scales to truly noisy field data and whether instruction tuning improves downstream generalization beyond these benchmarks.