Instructions to use llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft

SGLang

How to use llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft with Docker Model Runner:
```
docker model run hf.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft
```

LLaMA-MoE-v1-3.5B (2/8) SFT

[💻 Code] | [📜 Technical Report]

This is the supervised fine-tuned version of LLaMA-MoE-v1-3_5B-2_8 on Deita-6k for 2 epochs.

Model	#Activated Experts	#Experts	#Activated Params	Foundation Model	SFT Model
LLaMA-MoE-3.0B	2	16	3.0B	🤗 base	🤗 SFT
LLaMA-MoE-3.5B (4/16)	4	16	3.5B	🤗 base	🤗 SFT
LLaMA-MoE-3.5B (2/8)	2	8	3.5B	🤗 base	🤗 SFT

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.cuda()

input_text = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. human: Give me a three-day plan in Suzhou. gpt:"
inputs = tokenizer(input_text, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()

pred = model.generate(input_ids, max_length=100, temperature=1.0, do_sample=True, use_cache=True)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
"""
Sure, I can provide you with a three-day itinerary in Suzhou. Here's what we can do:

Day 1:

* Visit Suzhou Industrial Park, a major commercial and manufacturing district ...
"""

📊 Performance

Model	MMLU	ARC-c	HellaSeag	TruthfulQA	MT-Bench
Sheared LLaMA-2.7B ShareGPT	28.41	41.04	71.21	47.65	3.79
Sheared LLaMA-2.7B Deita6K (Our Impl.)	25.24	43.69	71.70	49.00	4.06
LLaMA-MoE-v1-3.0B (2/16)	23.61	43.43	72.28	44.24	4.15
LLaMA-MoE-v1-3.5B (4/16)	26.49	48.29	75.10	45.91	4.60
LLaMA-MoE-v1-3.5B (2/8)	25.53	45.99	74.95	44.39	4.72

📃 Citation

@article{llama-moe,
  title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
  author={Tong Zhu and Xiaoye Qu and Daize Dong and Jiacheng Ruan and Jingqi Tong and Conghui He and Yu Cheng},
  journal={arXiv preprint arXiv:2406.16554},
  year={2024},
  url={https://arxiv.org/abs/2406.16554},
}

Downloads last month: 28

Safetensors

Model size

7B params

Tensor type

BF16

Paper for llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

Paper • 2406.16554 • Published Jun 24, 2024 • 1