rasyosef/Amharic-Passage-Retrieval-Dataset-V2
Viewer • Updated • 68.3k • 70
How to use rasyosef/splade-amharic-base with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("rasyosef/splade-amharic-base")
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]This is a SPLADE Sparse Encoder model finetuned from rasyosef/roberta-base-amharic using the sentence-transformers library. It maps sentences & paragraphs to a 32000-dimensional sparse vector space and can be used for semantic search and sparse retrieval in Amharic.
This model was presented in the paper The Multilingual Curse at the Retrieval Layer: Evidence from Amharic.
Official code repository: https://github.com/rasyosef/amharic-neural-ir
SparseEncoder(
(0): MLMTransformer({'max_seq_length': 510, 'do_lower_case': False, 'architecture': 'XLMRobertaForMaskedLM'})
(1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 32000})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SparseEncoder
# Download from the 🤗 Hub
model = SparseEncoder("rasyosef/splade-amharic-base")
# Run inference
sentences = [
'ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና',
'የኢትዮጵያ ዋነኛ የውጭ ምንዛሬ ምንጭ የሆነው ወደ ውጭ የሚላክ ቡና ዘርፍ በአሁኑ ጊዜ ከፍተኛ ውጥረት ውስጥ ገብቷል። በዚህ የተነሳም የኢትዮጵያ ቡናና ሻይ ባለሥልጣንን ጨምሮ የሚመላካታቸው ሁሉ ቡና ላኪዎችና አምራቾች ያከማቹትን ቡና በፍጥነት ወደ ዓለም ገበያ እንዲያወጡ ጥሪ እያቀረቡ ነው ።',
'የቻይናው ፕሬዝዳንት ዚ ጂንፒንግ ከትራምፕ ጋር ባደረጉት ጉባኤ ትኩረታቸው በሁለቱ ሀገራት መካከል ለወራት ከተፈጠረ ውጥረት እና የንግድ ጦርነት በኋላ የተረገጋጋ ግንኙነትን ማስቀጠል ነበር። ከፑቲን ጋር ደግሞ ዢ ለሁለቱ አገራት ስልታዊም ሆነ ኢኮኖሚያዊ ጠቀሜታ ረጅም ጊዜ የዘለቀውን አጋርነትን ይበልጥ ማጠናከር ላይ ነበር ትኩረታቸው።',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 32000]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[45.2024, 19.3316, 0.0000],
# [19.3316, 48.7685, 8.5323],
# [ 0.0000, 8.5323, 63.2857]])
SparseInformationRetrievalEvaluator| Metric | Value |
|---|---|
| dot_accuracy@1 | 0.6632 |
| dot_accuracy@3 | 0.832 |
| dot_accuracy@5 | 0.8711 |
| dot_accuracy@10 | 0.9063 |
| dot_precision@1 | 0.6632 |
| dot_precision@3 | 0.2773 |
| dot_precision@5 | 0.1742 |
| dot_precision@10 | 0.0906 |
| dot_recall@1 | 0.6632 |
| dot_recall@3 | 0.832 |
| dot_recall@5 | 0.8711 |
| dot_recall@10 | 0.9063 |
| dot_ndcg@10 | 0.7917 |
| dot_mrr@10 | 0.7541 |
| dot_map@100 | 0.7571 |
| query_active_dims | 69.6236 |
| query_sparsity_ratio | 0.9978 |
| corpus_active_dims | 153.6359 |
| corpus_sparsity_ratio | 0.9952 |
anchor, positive, and negativeSpladeLoss with these parameters:{
"loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score')",
"document_regularizer_weight": 0.001,
"query_regularizer_weight": 0.002
}
eval_strategy: epochper_device_train_batch_size: 32per_device_eval_batch_size: 32gradient_accumulation_steps: 2learning_rate: 6e-05num_train_epochs: 6lr_scheduler_type: cosinewarmup_ratio: 0.025fp16: Trueoptim: adamw_torch_fusedbatch_sampler: no_duplicates@inproceedings{alemneh2026amharicir,
title = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
author = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
year = {2026},
}
Base model
rasyosef/roberta-base-amharic