Instructions to use Qwen/Qwen-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Qwen/Qwen-7B", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Qwen/Qwen-7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen-7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen-7B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Qwen/Qwen-7B
- SGLang
How to use Qwen/Qwen-7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen-7B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen-7B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Qwen/Qwen-7B with Docker Model Runner:
docker model run hf.co/Qwen/Qwen-7B
Transforming Qwen 7B into Your Own Reasoning Model on AWS accounts
#22
by samagra-tensorfuse - opened
Here are the optimisation strategies we have followed to achieve it:
- GRPO (DeepSeek’s RL algo) + Unsloth = 2x faster training.
- Deployed a vLLM server using Tensorfuse on AWS L40 GPU with just one CLI command—no infrastructure headaches!
- Saved fine-tuned LoRA modules directly to Hugging Face for easy sharing, versioning and integration.
Step-by-step guide: https://tensorfuse.io/docs/guides/reasoning/unsloth/qwen7b
Hope this helps you boost your LLM workflows.
We’re looking forward to any thoughts or feedback. Feel free to share any issues you run into or suggestions for future enhancements 🤝.
Let’s build something amazing together! 🌟
Sign up for Tensorfuse here: https://prod.tensorfuse.io/