SPLADE-Amharic-Base

This is a SPLADE Sparse Encoder model finetuned from rasyosef/roberta-base-amharic using the sentence-transformers library. It maps sentences & paragraphs to a 32000-dimensional sparse vector space and can be used for semantic search and sparse retrieval in Amharic.

This model was presented in the paper The Multilingual Curse at the Retrieval Layer: Evidence from Amharic.

Official code repository: https://github.com/rasyosef/amharic-neural-ir

Model Details

Model Description

Model Type: SPLADE Sparse Encoder
Base model: rasyosef/roberta-base-amharic
Maximum Sequence Length: 510 tokens
Output Dimensionality: 32000 dimensions
Similarity Function: Dot Product
Language: am
License: mit

Model Sources

Documentation: Sentence Transformers Documentation
Documentation: Sparse Encoder Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sparse Encoders on Hugging Face

Full Model Architecture

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 510, 'do_lower_case': False, 'architecture': 'XLMRobertaForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 32000})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("rasyosef/splade-amharic-base")
# Run inference
sentences = [
    'ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና',
    'የኢትዮጵያ ዋነኛ የውጭ ምንዛሬ ምንጭ የሆነው ወደ ውጭ የሚላክ ቡና ዘርፍ በአሁኑ ጊዜ ከፍተኛ ውጥረት ውስጥ ገብቷል። በዚህ የተነሳም የኢትዮጵያ ቡናና ሻይ ባለሥልጣንን ጨምሮ የሚመላካታቸው ሁሉ ቡና ላኪዎችና አምራቾች ያከማቹትን ቡና በፍጥነት ወደ ዓለም ገበያ እንዲያወጡ ጥሪ እያቀረቡ ነው ።',
    'የቻይናው ፕሬዝዳንት ዚ ጂንፒንግ ከትራምፕ ጋር ባደረጉት ጉባኤ ትኩረታቸው በሁለቱ ሀገራት መካከል ለወራት ከተፈጠረ ውጥረት እና የንግድ ጦርነት በኋላ የተረገጋጋ ግንኙነትን ማስቀጠል ነበር። ከፑቲን ጋር ደግሞ ዢ ለሁለቱ አገራት ስልታዊም ሆነ ኢኮኖሚያዊ ጠቀሜታ ረጅም ጊዜ የዘለቀውን አጋርነትን ይበልጥ ማጠናከር ላይ ነበር ትኩረታቸው።',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 32000]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[45.2024, 19.3316,  0.0000],
#         [19.3316, 48.7685,  8.5323],
#         [ 0.0000,  8.5323, 63.2857]])

Evaluation

Metrics

Sparse Information Retrieval

Evaluated with SparseInformationRetrievalEvaluator

Metric	Value
dot_accuracy@1	0.6632
dot_accuracy@3	0.832
dot_accuracy@5	0.8711
dot_accuracy@10	0.9063
dot_precision@1	0.6632
dot_precision@3	0.2773
dot_precision@5	0.1742
dot_precision@10	0.0906
dot_recall@1	0.6632
dot_recall@3	0.832
dot_recall@5	0.8711
dot_recall@10	0.9063
dot_ndcg@10	0.7917
dot_mrr@10	0.7541
dot_map@100	0.7571
query_active_dims	69.6236
query_sparsity_ratio	0.9978
corpus_active_dims	153.6359
corpus_sparsity_ratio	0.9952

Training Details

Training Dataset

Unnamed Dataset

Size: 245,876 training samples
Columns: anchor, positive, and negative

Loss: SpladeLoss with these parameters:

{
    "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score')",
    "document_regularizer_weight": 0.001,
    "query_regularizer_weight": 0.002
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
gradient_accumulation_steps: 2
learning_rate: 6e-05
num_train_epochs: 6
lr_scheduler_type: cosine
warmup_ratio: 0.025
fp16: True
optim: adamw_torch_fused
batch_sampler: no_duplicates

Citation

@inproceedings{alemneh2026amharicir,
  title     = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
  author    = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
  booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
  year      = {2026},
}