Spaces:

Alovestocode
/

ZeroGPU-LLM-Inference

Sleeping

File size: 5,315 Bytes

aa65d00

# Google Cloud Platform Deployment Guide

This guide covers deploying the Router Agent application to Google Cloud Platform with GPU support.

## Prerequisites

1. **Google Cloud Account** with billing enabled
2. **gcloud CLI** installed and configured
   ```bash
   curl https://sdk.cloud.google.com | bash
   gcloud init
   ```
3. **Docker** installed locally
4. **HF_TOKEN** environment variable set (for accessing private models)

## Deployment Options

### Option 1: Cloud Run (Serverless, CPU only)

**Pros:**
- Serverless, pay-per-use
- Auto-scaling
- No VM management

**Cons:**
- No GPU support (CPU inference only)
- Cold starts
- Limited to 8GB memory

**Steps:**

```bash
# Set your project ID
export GCP_PROJECT_ID="your-project-id"
export GCP_REGION="us-central1"

# Make script executable
chmod +x deploy-gcp.sh

# Deploy to Cloud Run
./deploy-gcp.sh cloud-run
```

**Cost:** ~$0.10-0.50/hour when active (depends on traffic)

### Option 2: Compute Engine with GPU (Recommended for Production)

**Pros:**
- Full GPU support (T4, V100, A100)
- Persistent instance
- Better for long-running workloads
- Lower latency (no cold starts)

**Cons:**
- Requires VM management
- Higher cost for always-on instances

**Steps:**

```bash
# Set your project ID and zone
export GCP_PROJECT_ID="your-project-id"
export GCP_ZONE="us-central1-a"
export HF_TOKEN="your-huggingface-token"

# Make script executable
chmod +x deploy-compute-engine.sh

# Deploy to Compute Engine
./deploy-compute-engine.sh
```

**GPU Options:**
- **T4** (nvidia-tesla-t4): ~$0.35/hour - Good for 27B-32B models with quantization
- **V100** (nvidia-tesla-v100): ~$2.50/hour - Better performance
- **A100** (nvidia-a100): ~$3.50/hour - Best performance for large models

**Cost:** GPU instance + storage (~$0.35-3.50/hour depending on GPU type)

## Manual Deployment Steps

### 1. Build and Push Docker Image

```bash
# Authenticate Docker
gcloud auth configure-docker

# Set project
gcloud config set project YOUR_PROJECT_ID

# Build image
docker build -t gcr.io/YOUR_PROJECT_ID/router-agent:latest .

# Push to Container Registry
docker push gcr.io/YOUR_PROJECT_ID/router-agent:latest
```

### 2. Deploy to Cloud Run (CPU)

```bash
gcloud run deploy router-agent \
    --image gcr.io/YOUR_PROJECT_ID/router-agent:latest \
    --platform managed \
    --region us-central1 \
    --allow-unauthenticated \
    --port 7860 \
    --memory 8Gi \
    --cpu 4 \
    --timeout 3600 \
    --set-env-vars "HF_TOKEN=your-token,GRADIO_SERVER_NAME=0.0.0.0,GRADIO_SERVER_PORT=7860"
```

### 3. Deploy to Compute Engine (GPU)

```bash
# Create VM with GPU
gcloud compute instances create router-agent-gpu \
    --zone=us-central1-a \
    --machine-type=n1-standard-4 \
    --accelerator="type=nvidia-tesla-t4,count=1" \
    --image-family=cos-stable \
    --image-project=cos-cloud \
    --boot-disk-size=100GB \
    --maintenance-policy=TERMINATE \
    --scopes=https://www.googleapis.com/auth/cloud-platform

# SSH into instance
gcloud compute ssh router-agent-gpu --zone=us-central1-a

# On the VM, install Docker and NVIDIA runtime
# Then pull and run the container
docker pull gcr.io/YOUR_PROJECT_ID/router-agent:latest
docker run -d \
    --name router-agent \
    --gpus all \
    -p 7860:7860 \
    -e HF_TOKEN="your-token" \
    gcr.io/YOUR_PROJECT_ID/router-agent:latest
```

## Environment Variables

Set these in Cloud Run or as VM metadata:

- `HF_TOKEN`: Hugging Face access token (required for private models)
- `GRADIO_SERVER_NAME`: Server hostname (default: 0.0.0.0)
- `GRADIO_SERVER_PORT`: Server port (default: 7860)
- `ROUTER_PREFETCH_MODELS`: Comma-separated list of models to preload
- `ROUTER_WARM_REMAINING`: Set to "1" to warm remaining models

## Monitoring and Logs

### Cloud Run Logs
```bash
gcloud run services logs read router-agent --region us-central1
```

### Compute Engine Logs
```bash
gcloud compute instances get-serial-port-output router-agent-gpu --zone us-central1-a
```

## Cost Optimization

1. **Cloud Run**: Use only when needed, auto-scales to zero
2. **Compute Engine**: 
   - Use preemptible instances for 80% cost savings (with risk of termination)
   - Stop instance when not in use: `gcloud compute instances stop router-agent-gpu --zone us-central1-a`
   - Use smaller GPU types (T4) for development, larger (A100) for production

## Troubleshooting

### GPU Not Available
- Check GPU quota: `gcloud compute project-info describe --project YOUR_PROJECT_ID`
- Request quota increase if needed
- Verify GPU drivers are installed on Compute Engine VM

### Out of Memory
- Increase Cloud Run memory: `--memory 16Gi`
- Use larger VM instance type
- Enable model quantization (AWQ/BitsAndBytes)

### Cold Starts (Cloud Run)
- Use Cloud Run min-instances to keep warm
- Pre-warm models on startup
- Consider Compute Engine for always-on workloads

## Security

1. **Authentication**: Use Cloud Run authentication or Cloud IAP for Compute Engine
2. **Secrets**: Store HF_TOKEN in Secret Manager
3. **Firewall**: Restrict access to specific IP ranges
4. **HTTPS**: Use Cloud Load Balancer with SSL certificate

## Next Steps

1. Set up Cloud Load Balancer for HTTPS
2. Configure monitoring and alerts
3. Set up CI/CD with Cloud Build
4. Use Cloud Storage for model caching
5. Implement auto-scaling policies