LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
Paper β’ 2406.16554 β’ Published β’ 1
How to use llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft", trust_remote_code=True) # Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft", trust_remote_code=True, dtype="auto")How to use llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft
How to use llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft with Docker Model Runner:
docker model run hf.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft
[π» Code] | [π Technical Report]
This is the supervised fine-tuned version of LLaMA-MoE-v1-3_5B-2_8 on Deita-6k for 2 epochs.
| Model | #Activated Experts | #Experts | #Activated Params | Foundation Model | SFT Model |
|---|---|---|---|---|---|
| LLaMA-MoE-3.0B | 2 | 16 | 3.0B | π€ base | π€ SFT |
| LLaMA-MoE-3.5B (4/16) | 4 | 16 | 3.5B | π€ base | π€ SFT |
| LLaMA-MoE-3.5B (2/8) | 2 | 8 | 3.5B | π€ base | π€ SFT |
# python>=3.10
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.cuda()
input_text = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. human: Give me a three-day plan in Suzhou. gpt:"
inputs = tokenizer(input_text, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
pred = model.generate(input_ids, max_length=100, temperature=1.0, do_sample=True, use_cache=True)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
"""
Sure, I can provide you with a three-day itinerary in Suzhou. Here's what we can do:
Day 1:
* Visit Suzhou Industrial Park, a major commercial and manufacturing district ...
"""
| Model | MMLU | ARC-c | HellaSeag | TruthfulQA | MT-Bench |
|---|---|---|---|---|---|
| Sheared LLaMA-2.7B ShareGPT | 28.41 | 41.04 | 71.21 | 47.65 | 3.79 |
| Sheared LLaMA-2.7B Deita6K (Our Impl.) | 25.24 | 43.69 | 71.70 | 49.00 | 4.06 |
| LLaMA-MoE-v1-3.0B (2/16) | 23.61 | 43.43 | 72.28 | 44.24 | 4.15 |
| LLaMA-MoE-v1-3.5B (4/16) | 26.49 | 48.29 | 75.10 | 45.91 | 4.60 |
| LLaMA-MoE-v1-3.5B (2/8) | 25.53 | 45.99 | 74.95 | 44.39 | 4.72 |
@article{llama-moe,
title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
author={Tong Zhu and Xiaoye Qu and Daize Dong and Jiacheng Ruan and Jingqi Tong and Conghui He and Yu Cheng},
journal={arXiv preprint arXiv:2406.16554},
year={2024},
url={https://arxiv.org/abs/2406.16554},
}