๐ŸŒพOryzaG3 : A Single-species Genomic Foundation Model Pretrained on Rice Pangenome

Model Introduction

OryzaG3 is a single-species (rice) DNA language model (700M) pretrained on 149 high-quality rice pangenomes.The model adopts a non-overlapping 3-mer tokenization strategy and uses Causal Language Modeling (CLM) as the pretraining objective. It comes in two context-length versions:

  • OryzaG3-8k
  • OryzaG3-32k โœ…

image

Performance Comparison: Plants Genomic Benchmark-polyA

Indica Group ๐ŸŒพ

Model AUC AP F1 MCC Accuracy Samples/s
OryzaG3-8k (700M) 0.970 0.942 0.887 0.830 0.924 400.41
OryzaG3-32k (700M) 0.970 0.941 0.890 0.835 0.926 399.80
AgroNT (1B) 0.969 0.937 0.879 0.818 0.919 95.47
Botanic0-L (991M) 0.966 0.931 0.873 0.809 0.914 92.32

Japonica Group ๐ŸŒพ shows similarly competitive performance.

Main Contributions

  1. Construction of OryzaG3, the DNA language model trained on large-scale high-quality rice pangenomes.
  2. Demonstration that OryzaG3 matches or exceeds mainstream multi-species models on rice-specific tasks while delivering superior inference efficiency.
  3. Systematic analysis revealing the critical impact of pretraining sufficiency (via checkpoints) on downstream performance.
  4. Comprehensive evaluation of the speed and memory benefits of FlashAttention-2 and Gradient Checkpointing in long-context training.
  5. Providing a reproducible technical framework for developing lightweight, crop-specific genomic foundation models.

Note: The OryzaG3 model was initialized using only the Gemma3-1B architecture (config), without loading the original pretrained weights.

Downloads last month
-
Safetensors
Model size
0.7B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support