๐พOryzaG3 : A Single-species Genomic Foundation Model Pretrained on Rice Pangenome
Model Introduction
OryzaG3 is a single-species (rice) DNA language model (700M) pretrained on 149 high-quality rice pangenomes.The model adopts a non-overlapping 3-mer tokenization strategy and uses Causal Language Modeling (CLM) as the pretraining objective. It comes in two context-length versions:
- OryzaG3-8k
- OryzaG3-32k โ
Performance Comparison: Plants Genomic Benchmark-polyA
Indica Group ๐พ
| Model | AUC | AP | F1 | MCC | Accuracy | Samples/s |
|---|---|---|---|---|---|---|
| OryzaG3-8k (700M) | 0.970 | 0.942 | 0.887 | 0.830 | 0.924 | 400.41 |
| OryzaG3-32k (700M) | 0.970 | 0.941 | 0.890 | 0.835 | 0.926 | 399.80 |
| AgroNT (1B) | 0.969 | 0.937 | 0.879 | 0.818 | 0.919 | 95.47 |
| Botanic0-L (991M) | 0.966 | 0.931 | 0.873 | 0.809 | 0.914 | 92.32 |
Japonica Group ๐พ shows similarly competitive performance.
Main Contributions
- Construction of OryzaG3, the DNA language model trained on large-scale high-quality rice pangenomes.
- Demonstration that OryzaG3 matches or exceeds mainstream multi-species models on rice-specific tasks while delivering superior inference efficiency.
- Systematic analysis revealing the critical impact of pretraining sufficiency (via checkpoints) on downstream performance.
- Comprehensive evaluation of the speed and memory benefits of FlashAttention-2 and Gradient Checkpointing in long-context training.
- Providing a reproducible technical framework for developing lightweight, crop-specific genomic foundation models.
Note: The OryzaG3 model was initialized using only the Gemma3-1B architecture (config), without loading the original pretrained weights.
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
