Instructions to use moelanoby/Qwen2-1.5B-wina with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moelanoby/Qwen2-1.5B-wina with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="moelanoby/Qwen2-1.5B-wina") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("moelanoby/Qwen2-1.5B-wina") model = AutoModelForCausalLM.from_pretrained("moelanoby/Qwen2-1.5B-wina") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use moelanoby/Qwen2-1.5B-wina with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moelanoby/Qwen2-1.5B-wina" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moelanoby/Qwen2-1.5B-wina", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/moelanoby/Qwen2-1.5B-wina
- SGLang
How to use moelanoby/Qwen2-1.5B-wina with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moelanoby/Qwen2-1.5B-wina" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moelanoby/Qwen2-1.5B-wina", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moelanoby/Qwen2-1.5B-wina" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moelanoby/Qwen2-1.5B-wina", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use moelanoby/Qwen2-1.5B-wina with Docker Model Runner:
docker model run hf.co/moelanoby/Qwen2-1.5B-wina
use the following script to load the model
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# --- Critical: Import the custom functions needed to build the model ---
from wina_utils import apply_wina_to_model
# --- Configuration ---
# The path to your locally saved model, or its ID on the Hugging Face Hub
MODEL_PATH = "moelanoby/Qwen2-1.5B-wina"
# Example after upload:
SPARSITY_LEVEL = 0.65 # THIS MUST STAY 0.65 OR 65% ANYTHING ELSE MIGHT CAUSE ERRORS
def load_custom_wina_model(model_path: str, sparsity: float):
"""
Loads a custom WINA-modified model correctly.
"""
print(f"Loading WINA model from: {model_path}")
# --- Step 1: Load the tokenizer ---
# trust_remote_code=True is needed here because the model folder contains wina_utils.py
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# --- Step 2: Load the BASE model architecture ---
# We load the original model structure first. It will have the standard MLP layers.
# trust_remote_code=True allows it to execute the original model's code if needed.
print("Loading the base model architecture...")
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
# --- Step 3: Manually apply the WINA transformation to the architecture ---
# This swaps the standard MLP layers with our WinaMLP layers, creating the
# correct structure in memory to receive the saved weights.
print("Applying WINA transformation to the loaded architecture...")
apply_wina_to_model(model, sparsity_level=sparsity)
print("\n" + "="*50)
print("SUCCESS: Custom WINA model loaded correctly.")
print("="*50)
return model, tokenizer
if __name__ == "__main__":
wina_model, wina_tokenizer = load_custom_wina_model(MODEL_PATH, SPARSITY_LEVEL)
# You can now use the model for inference
print("\nVerifying model architecture (you should see 'WinaMLP' layers):")
print(wina_model)
- Downloads last month
- 5
Model tree for moelanoby/Qwen2-1.5B-wina
Base model
Qwen/Qwen2-1.5B-Instruct