Spaces:
Sleeping
Sleeping
File size: 5,315 Bytes
aa65d00 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
# Google Cloud Platform Deployment Guide
This guide covers deploying the Router Agent application to Google Cloud Platform with GPU support.
## Prerequisites
1. **Google Cloud Account** with billing enabled
2. **gcloud CLI** installed and configured
```bash
curl https://sdk.cloud.google.com | bash
gcloud init
```
3. **Docker** installed locally
4. **HF_TOKEN** environment variable set (for accessing private models)
## Deployment Options
### Option 1: Cloud Run (Serverless, CPU only)
**Pros:**
- Serverless, pay-per-use
- Auto-scaling
- No VM management
**Cons:**
- No GPU support (CPU inference only)
- Cold starts
- Limited to 8GB memory
**Steps:**
```bash
# Set your project ID
export GCP_PROJECT_ID="your-project-id"
export GCP_REGION="us-central1"
# Make script executable
chmod +x deploy-gcp.sh
# Deploy to Cloud Run
./deploy-gcp.sh cloud-run
```
**Cost:** ~$0.10-0.50/hour when active (depends on traffic)
### Option 2: Compute Engine with GPU (Recommended for Production)
**Pros:**
- Full GPU support (T4, V100, A100)
- Persistent instance
- Better for long-running workloads
- Lower latency (no cold starts)
**Cons:**
- Requires VM management
- Higher cost for always-on instances
**Steps:**
```bash
# Set your project ID and zone
export GCP_PROJECT_ID="your-project-id"
export GCP_ZONE="us-central1-a"
export HF_TOKEN="your-huggingface-token"
# Make script executable
chmod +x deploy-compute-engine.sh
# Deploy to Compute Engine
./deploy-compute-engine.sh
```
**GPU Options:**
- **T4** (nvidia-tesla-t4): ~$0.35/hour - Good for 27B-32B models with quantization
- **V100** (nvidia-tesla-v100): ~$2.50/hour - Better performance
- **A100** (nvidia-a100): ~$3.50/hour - Best performance for large models
**Cost:** GPU instance + storage (~$0.35-3.50/hour depending on GPU type)
## Manual Deployment Steps
### 1. Build and Push Docker Image
```bash
# Authenticate Docker
gcloud auth configure-docker
# Set project
gcloud config set project YOUR_PROJECT_ID
# Build image
docker build -t gcr.io/YOUR_PROJECT_ID/router-agent:latest .
# Push to Container Registry
docker push gcr.io/YOUR_PROJECT_ID/router-agent:latest
```
### 2. Deploy to Cloud Run (CPU)
```bash
gcloud run deploy router-agent \
--image gcr.io/YOUR_PROJECT_ID/router-agent:latest \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--port 7860 \
--memory 8Gi \
--cpu 4 \
--timeout 3600 \
--set-env-vars "HF_TOKEN=your-token,GRADIO_SERVER_NAME=0.0.0.0,GRADIO_SERVER_PORT=7860"
```
### 3. Deploy to Compute Engine (GPU)
```bash
# Create VM with GPU
gcloud compute instances create router-agent-gpu \
--zone=us-central1-a \
--machine-type=n1-standard-4 \
--accelerator="type=nvidia-tesla-t4,count=1" \
--image-family=cos-stable \
--image-project=cos-cloud \
--boot-disk-size=100GB \
--maintenance-policy=TERMINATE \
--scopes=https://www.googleapis.com/auth/cloud-platform
# SSH into instance
gcloud compute ssh router-agent-gpu --zone=us-central1-a
# On the VM, install Docker and NVIDIA runtime
# Then pull and run the container
docker pull gcr.io/YOUR_PROJECT_ID/router-agent:latest
docker run -d \
--name router-agent \
--gpus all \
-p 7860:7860 \
-e HF_TOKEN="your-token" \
gcr.io/YOUR_PROJECT_ID/router-agent:latest
```
## Environment Variables
Set these in Cloud Run or as VM metadata:
- `HF_TOKEN`: Hugging Face access token (required for private models)
- `GRADIO_SERVER_NAME`: Server hostname (default: 0.0.0.0)
- `GRADIO_SERVER_PORT`: Server port (default: 7860)
- `ROUTER_PREFETCH_MODELS`: Comma-separated list of models to preload
- `ROUTER_WARM_REMAINING`: Set to "1" to warm remaining models
## Monitoring and Logs
### Cloud Run Logs
```bash
gcloud run services logs read router-agent --region us-central1
```
### Compute Engine Logs
```bash
gcloud compute instances get-serial-port-output router-agent-gpu --zone us-central1-a
```
## Cost Optimization
1. **Cloud Run**: Use only when needed, auto-scales to zero
2. **Compute Engine**:
- Use preemptible instances for 80% cost savings (with risk of termination)
- Stop instance when not in use: `gcloud compute instances stop router-agent-gpu --zone us-central1-a`
- Use smaller GPU types (T4) for development, larger (A100) for production
## Troubleshooting
### GPU Not Available
- Check GPU quota: `gcloud compute project-info describe --project YOUR_PROJECT_ID`
- Request quota increase if needed
- Verify GPU drivers are installed on Compute Engine VM
### Out of Memory
- Increase Cloud Run memory: `--memory 16Gi`
- Use larger VM instance type
- Enable model quantization (AWQ/BitsAndBytes)
### Cold Starts (Cloud Run)
- Use Cloud Run min-instances to keep warm
- Pre-warm models on startup
- Consider Compute Engine for always-on workloads
## Security
1. **Authentication**: Use Cloud Run authentication or Cloud IAP for Compute Engine
2. **Secrets**: Store HF_TOKEN in Secret Manager
3. **Firewall**: Restrict access to specific IP ranges
4. **HTTPS**: Use Cloud Load Balancer with SSL certificate
## Next Steps
1. Set up Cloud Load Balancer for HTTPS
2. Configure monitoring and alerts
3. Set up CI/CD with Cloud Build
4. Use Cloud Storage for model caching
5. Implement auto-scaling policies
|