Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks
Paper โข 2511.07025 โข Published โข 16
How to use ncorder/llama-embed-nemotron-8b-mlx-fp16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir llama-embed-nemotron-8b-mlx-fp16 ncorder/llama-embed-nemotron-8b-mlx-fp16
How to use ncorder/llama-embed-nemotron-8b-mlx-fp16 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("ncorder/llama-embed-nemotron-8b-mlx-fp16", trust_remote_code=True)
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]MLX conversion of nvidia/llama-embed-nemotron-8b โ the #1 embedding model under 20B parameters on the MTEB leaderboard, outperforming models 3x its size.
Full float16 precision. Near-identical to the original bfloat16 weights.
| Variant | Size | Relevant โ | Irrelevant โ | Margin โ |
|---|---|---|---|---|
| fp16 | 15 GB | 0.3763 | 0.0579 | 0.3184 |
| 8-bit | 7.5 GB | 0.3780 | 0.0583 | 0.3197 |
| 4-bit | 4.0 GB | 0.3826 | 0.0783 | 0.3043 |
| 2-bit | 2.4 GB | 0.4799 | 0.2873 | 0.1926 |
Reference (original bf16 PyTorch): relevant=0.3771, irrelevant=0.0581, margin=0.3190
pip install mlx-embeddings
from mlx_embeddings.utils import load
import mlx.core as mx
model, tokenizer = load("ncorder/llama-embed-nemotron-8b-mlx-fp16")
query = "Instruct: Given a question, retrieve passages that answer the question\nQuery: How do neural networks learn patterns from examples?"
document = "Deep learning models adjust their weights through backpropagation."
def embed(text):
inputs = tokenizer(text, return_tensors="np", padding=True)
out = model(
mx.array(inputs["input_ids"]),
mx.array(inputs["attention_mask"])
)
return out.text_embeds
q_emb = embed(query)
d_emb = embed(document)
score = (q_emb @ d_emb.T).item()
print(f"Similarity: {score:.4f}")
This model is instruction-aware. For retrieval, prefix queries with:
Instruct: {task_instruction}
Query: {your_query}
Documents are embedded without any prefix.
This model inherits the NVIDIA license from the original โ research/non-commercial use only. Also subject to the Llama 3.1 Community License.
nvidia/llama-embed-nemotron-8bmlx-embeddings by Prince CanumaQuantized
Base model
nvidia/llama-embed-nemotron-8b