Instructions to use bofenghuang/vigogne-2-70b-chat with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bofenghuang/vigogne-2-70b-chat with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bofenghuang/vigogne-2-70b-chat") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("bofenghuang/vigogne-2-70b-chat") model = AutoModelForCausalLM.from_pretrained("bofenghuang/vigogne-2-70b-chat") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use bofenghuang/vigogne-2-70b-chat with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bofenghuang/vigogne-2-70b-chat" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bofenghuang/vigogne-2-70b-chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/bofenghuang/vigogne-2-70b-chat
- SGLang
How to use bofenghuang/vigogne-2-70b-chat with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bofenghuang/vigogne-2-70b-chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bofenghuang/vigogne-2-70b-chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bofenghuang/vigogne-2-70b-chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bofenghuang/vigogne-2-70b-chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use bofenghuang/vigogne-2-70b-chat with Docker Model Runner:
docker model run hf.co/bofenghuang/vigogne-2-70b-chat
Vigogne-2-70B-Chat: A Llama-2-based French Chat LLM
Vigogne-2-70B-Chat is a French chat LLM, based on Llama-2-70B, optimized to generate helpful and coherent responses in conversations with users.
Check out our release blog and GitHub repository for more information.
Usage and License Notices: Vigogne-2-70B-Chat follows Llama-2's usage policy. A significant portion of the training data is distilled from GPT-3.5-Turbo and GPT-4, kindly use it cautiously to avoid any violations of OpenAI's terms of use.
Prompt Template
We used a prompt template adapted from the chat format of Llama-2.
You can apply this formatting using the chat template through the apply_chat_template() method.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bofenghuang/vigogne-2-70b-chat")
conversation = [
{"role": "user", "content": "Bonjour ! Comment ça va aujourd'hui ?"},
{"role": "assistant", "content": "Bonjour ! Je suis une IA, donc je n'ai pas de sentiments, mais je suis prêt à vous aider. Comment puis-je vous assister aujourd'hui ?"},
{"role": "user", "content": "Quelle est la hauteur de la Tour Eiffel ?"},
{"role": "assistant", "content": "La Tour Eiffel mesure environ 330 mètres de hauteur."},
{"role": "user", "content": "Comment monter en haut ?"},
]
print(tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True))
You will get
<s>[INST] <<SYS>>
Vous êtes Vigogne, un assistant IA créé par Zaion Lab. Vous suivez extrêmement bien les instructions. Aidez autant que vous le pouvez.
<</SYS>>
Bonjour ! Comment ça va aujourd'hui ? [/INST] Bonjour ! Je suis une IA, donc je n'ai pas de sentiments, mais je suis prêt à vous aider. Comment puis-je vous assister aujourd'hui ? </s>[INST] Quelle est la hauteur de la Tour Eiffel ? [/INST] La Tour Eiffel mesure environ 330 mètres de hauteur. </s>[INST] Comment monter en haut ? [/INST]
Usage
Inference using the unquantized model with 🤗 Transformers
from typing import Dict, List, Optional
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, TextStreamer
model_name_or_path = "bofenghuang/vigogne-2-70b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side="right", use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype=torch.float16, device_map="auto")
streamer = TextStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
def chat(
query: str,
history: Optional[List[Dict]] = None,
temperature: float = 0.7,
top_p: float = 1.0,
top_k: float = 0,
repetition_penalty: float = 1.1,
max_new_tokens: int = 1024,
**kwargs,
):
if history is None:
history = []
history.append({"role": "user", "content": query})
input_ids = tokenizer.apply_chat_template(history, return_tensors="pt").to(model.device)
input_length = input_ids.shape[1]
generated_outputs = model.generate(
input_ids=input_ids,
generation_config=GenerationConfig(
temperature=temperature,
do_sample=temperature > 0.0,
top_p=top_p,
top_k=top_k,
repetition_penalty=repetition_penalty,
max_new_tokens=max_new_tokens,
pad_token_id=tokenizer.eos_token_id,
**kwargs,
),
streamer=streamer,
return_dict_in_generate=True,
)
generated_tokens = generated_outputs.sequences[0, input_length:]
generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True)
history.append({"role": "assistant", "content": generated_text})
return generated_text, history
# 1st round
response, history = chat("Un escargot parcourt 100 mètres en 5 heures. Quelle est sa vitesse ?", history=None)
# 2nd round
response, history = chat("Quand il peut dépasser le lapin ?", history=history)
# 3rd round
response, history = chat("Écris une histoire imaginative qui met en scène une compétition de course entre un escargot et un lapin.", history=history)
You can also use the Google Colab Notebook provided below.
Limitations
Vigogne is still under development, and there are many limitations that have to be addressed. Please note that it is possible that the model generates harmful or biased content, incorrect information or generally unhelpful answers.
Acknowledgements
The model training was conducted on the Jean-Zay supercomputer at GENCI, and we extend our gratitude to the IDRIS team for their responsive support throughout the project.
- Downloads last month
- 960