ncorder/llama-embed-nemotron-8b-mlx-fp16

MLX conversion of nvidia/llama-embed-nemotron-8b โ€” the #1 embedding model under 20B parameters on the MTEB leaderboard, outperforming models 3x its size.

  • Parameters: 7.5B
  • Quantization: None (float16)
  • Model size: 15 GB
  • Architecture: Llama-3.1-8B with bidirectional attention
  • Embedding dimension: 4096
  • Max sequence length: 32,768 tokens
  • Converted with: mlx-embeddings

Full float16 precision. Near-identical to the original bfloat16 weights.

All variants

Variant Size Relevant โ†‘ Irrelevant โ†“ Margin โ†‘
fp16 15 GB 0.3763 0.0579 0.3184
8-bit 7.5 GB 0.3780 0.0583 0.3197
4-bit 4.0 GB 0.3826 0.0783 0.3043
2-bit 2.4 GB 0.4799 0.2873 0.1926

Reference (original bf16 PyTorch): relevant=0.3771, irrelevant=0.0581, margin=0.3190

Usage

pip install mlx-embeddings
from mlx_embeddings.utils import load
import mlx.core as mx

model, tokenizer = load("ncorder/llama-embed-nemotron-8b-mlx-fp16")

query = "Instruct: Given a question, retrieve passages that answer the question\nQuery: How do neural networks learn patterns from examples?"
document = "Deep learning models adjust their weights through backpropagation."

def embed(text):
    inputs = tokenizer(text, return_tensors="np", padding=True)
    out = model(
        mx.array(inputs["input_ids"]),
        mx.array(inputs["attention_mask"])
    )
    return out.text_embeds

q_emb = embed(query)
d_emb = embed(document)
score = (q_emb @ d_emb.T).item()
print(f"Similarity: {score:.4f}")

Query formatting

This model is instruction-aware. For retrieval, prefix queries with:

Instruct: {task_instruction}
Query: {your_query}

Documents are embedded without any prefix.

License

This model inherits the NVIDIA license from the original โ€” research/non-commercial use only. Also subject to the Llama 3.1 Community License.

Credits

Downloads last month
166
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ncorder/llama-embed-nemotron-8b-mlx-fp16

Finetuned
(5)
this model

Paper for ncorder/llama-embed-nemotron-8b-mlx-fp16