Instructions to use prithivMLmods/Llama-3.1-8B-Open-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prithivMLmods/Llama-3.1-8B-Open-SFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="prithivMLmods/Llama-3.1-8B-Open-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("prithivMLmods/Llama-3.1-8B-Open-SFT")
model = AutoModelForCausalLM.from_pretrained("prithivMLmods/Llama-3.1-8B-Open-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use prithivMLmods/Llama-3.1-8B-Open-SFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prithivMLmods/Llama-3.1-8B-Open-SFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Llama-3.1-8B-Open-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/prithivMLmods/Llama-3.1-8B-Open-SFT

SGLang

How to use prithivMLmods/Llama-3.1-8B-Open-SFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "prithivMLmods/Llama-3.1-8B-Open-SFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Llama-3.1-8B-Open-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "prithivMLmods/Llama-3.1-8B-Open-SFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Llama-3.1-8B-Open-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use prithivMLmods/Llama-3.1-8B-Open-SFT with Docker Model Runner:
```
docker model run hf.co/prithivMLmods/Llama-3.1-8B-Open-SFT
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Llama-3.1-8B-Open-SFT

The Llama-3.1-8B-Open-SFT model is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct, designed for advanced text generation tasks, including conversational interactions, question answering, and chain-of-thought reasoning. This model leverages Supervised Fine-Tuning (SFT) using the O1-OPEN/OpenO1-SFT dataset to provide enhanced performance in context-sensitive and instruction-following tasks.

File Name	Size	Description	Upload Status
`.gitattributes`	1.57 kB	Git LFS configuration for tracking large files.	Uploaded
`README.md`	324 Bytes	Updated README with minimal information.	Uploaded
`config.json`	1.03 kB	Model configuration and metadata.	Uploaded
`generation_config.json`	248 Bytes	Configuration for text generation specifics.	Uploaded
`pytorch_model-00001-of-00004.bin`	4.98 GB	First shard of PyTorch model.	Uploaded (LFS)
`pytorch_model-00002-of-00004.bin`	5.00 GB	Second shard of PyTorch model.	Uploaded (LFS)
`pytorch_model-00003-of-00004.bin`	4.92 GB	Third shard of PyTorch model.	Uploaded (LFS)
`pytorch_model-00004-of-00004.bin`	1.17 GB	Final shard of PyTorch model.	Uploaded (LFS)
`pytorch_model.bin.index.json`	24.2 kB	Index file for model shards.	Uploaded
`special_tokens_map.json`	357 Bytes	Map for special tokens used in tokenizer.	Uploaded
`tokenizer.json`	17.2 MB	Full tokenizer JSON file.	Uploaded (LFS)
`tokenizer_config.json`	57.4 kB	Configuration for the tokenizer.	Uploaded

Sample Long CoT:

Key Features

Text Generation with CoT Reasoning:
- Implements Chain-of-Thought (CoT) prompting for logical and step-by-step reasoning tasks.
Conversational AI:
- Excels in generating context-aware and coherent responses in multi-turn conversations.
Supervised Fine-Tuning (SFT):
- Optimized for open-domain tasks using the O1-OPEN/OpenO1-SFT dataset.
Multi-Purpose Functionality:
- Supports a wide range of NLP tasks, including summarization, question answering, and text completion.
Scalable Sharded Architecture:
- Model weights are distributed across four shards, ensuring efficient loading for large-scale applications.

Training Details

Base Model: meta-llama/Llama-3.1-8B
Finetuned Dataset: O1-OPEN/OpenO1-SFT
- Dataset includes 77.7k fine-tuning samples, curated for instruction-based and open-domain tasks.
Model Size:
- 8 Billion parameters distributed over 4 shards for efficient deployment.

Applications

Chain-of-Thought (CoT) Reasoning:
- Solve complex problems step-by-step with logical reasoning capabilities.
Conversational Agents:
- Ideal for chatbots, virtual assistants, and conversational systems.
Question Answering:
- Answer open-domain or context-specific questions accurately.
Text Completion:
- Generate coherent continuations for incomplete inputs.
Creative Writing:
- Support for generating stories, articles, or brainstorming ideas.

Usage

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "prithivMLmods/Llama-3.1-8B-Open-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Inference Example

prompt = """
Explain the concept of gravity in a simple way suitable for a 10-year-old:
"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=150, temperature=0.7)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model Output:", response)

Expected Output

"Gravity is a force that pulls things toward each other. It's the reason why things fall to the ground when you drop them. On Earth, gravity keeps us on the ground and makes sure everything stays in place, like your toys, the water in the ocean, and even the air we breathe."

Performance Requirements

Hardware:
- High-performance GPUs are recommended for efficient inference.
- Minimum memory: ~16GB VRAM for full precision; 8GB for quantized models.
Optimization Options:
- Use Safetensors for secure and efficient weight loading.
- Quantization or model parallelism for resource-constrained environments.