# Google Cloud Platform Deployment Guide This guide covers deploying the Router Agent application to Google Cloud Platform with GPU support. ## Prerequisites 1. **Google Cloud Account** with billing enabled 2. **gcloud CLI** installed and configured ```bash curl https://sdk.cloud.google.com | bash gcloud init ``` 3. **Docker** installed locally 4. **HF_TOKEN** environment variable set (for accessing private models) ## Deployment Options ### Option 1: Cloud Run (Serverless, CPU only) **Pros:** - Serverless, pay-per-use - Auto-scaling - No VM management **Cons:** - No GPU support (CPU inference only) - Cold starts - Limited to 8GB memory **Steps:** ```bash # Set your project ID export GCP_PROJECT_ID="your-project-id" export GCP_REGION="us-central1" # Make script executable chmod +x deploy-gcp.sh # Deploy to Cloud Run ./deploy-gcp.sh cloud-run ``` **Cost:** ~$0.10-0.50/hour when active (depends on traffic) ### Option 2: Compute Engine with GPU (Recommended for Production) **Pros:** - Full GPU support (T4, V100, A100) - Persistent instance - Better for long-running workloads - Lower latency (no cold starts) **Cons:** - Requires VM management - Higher cost for always-on instances **Steps:** ```bash # Set your project ID and zone export GCP_PROJECT_ID="your-project-id" export GCP_ZONE="us-central1-a" export HF_TOKEN="your-huggingface-token" # Make script executable chmod +x deploy-compute-engine.sh # Deploy to Compute Engine ./deploy-compute-engine.sh ``` **GPU Options:** - **T4** (nvidia-tesla-t4): ~$0.35/hour - Good for 27B-32B models with quantization - **V100** (nvidia-tesla-v100): ~$2.50/hour - Better performance - **A100** (nvidia-a100): ~$3.50/hour - Best performance for large models **Cost:** GPU instance + storage (~$0.35-3.50/hour depending on GPU type) ## Manual Deployment Steps ### 1. Build and Push Docker Image ```bash # Authenticate Docker gcloud auth configure-docker # Set project gcloud config set project YOUR_PROJECT_ID # Build image docker build -t gcr.io/YOUR_PROJECT_ID/router-agent:latest . # Push to Container Registry docker push gcr.io/YOUR_PROJECT_ID/router-agent:latest ``` ### 2. Deploy to Cloud Run (CPU) ```bash gcloud run deploy router-agent \ --image gcr.io/YOUR_PROJECT_ID/router-agent:latest \ --platform managed \ --region us-central1 \ --allow-unauthenticated \ --port 7860 \ --memory 8Gi \ --cpu 4 \ --timeout 3600 \ --set-env-vars "HF_TOKEN=your-token,GRADIO_SERVER_NAME=0.0.0.0,GRADIO_SERVER_PORT=7860" ``` ### 3. Deploy to Compute Engine (GPU) ```bash # Create VM with GPU gcloud compute instances create router-agent-gpu \ --zone=us-central1-a \ --machine-type=n1-standard-4 \ --accelerator="type=nvidia-tesla-t4,count=1" \ --image-family=cos-stable \ --image-project=cos-cloud \ --boot-disk-size=100GB \ --maintenance-policy=TERMINATE \ --scopes=https://www.googleapis.com/auth/cloud-platform # SSH into instance gcloud compute ssh router-agent-gpu --zone=us-central1-a # On the VM, install Docker and NVIDIA runtime # Then pull and run the container docker pull gcr.io/YOUR_PROJECT_ID/router-agent:latest docker run -d \ --name router-agent \ --gpus all \ -p 7860:7860 \ -e HF_TOKEN="your-token" \ gcr.io/YOUR_PROJECT_ID/router-agent:latest ``` ## Environment Variables Set these in Cloud Run or as VM metadata: - `HF_TOKEN`: Hugging Face access token (required for private models) - `GRADIO_SERVER_NAME`: Server hostname (default: 0.0.0.0) - `GRADIO_SERVER_PORT`: Server port (default: 7860) - `ROUTER_PREFETCH_MODELS`: Comma-separated list of models to preload - `ROUTER_WARM_REMAINING`: Set to "1" to warm remaining models ## Monitoring and Logs ### Cloud Run Logs ```bash gcloud run services logs read router-agent --region us-central1 ``` ### Compute Engine Logs ```bash gcloud compute instances get-serial-port-output router-agent-gpu --zone us-central1-a ``` ## Cost Optimization 1. **Cloud Run**: Use only when needed, auto-scales to zero 2. **Compute Engine**: - Use preemptible instances for 80% cost savings (with risk of termination) - Stop instance when not in use: `gcloud compute instances stop router-agent-gpu --zone us-central1-a` - Use smaller GPU types (T4) for development, larger (A100) for production ## Troubleshooting ### GPU Not Available - Check GPU quota: `gcloud compute project-info describe --project YOUR_PROJECT_ID` - Request quota increase if needed - Verify GPU drivers are installed on Compute Engine VM ### Out of Memory - Increase Cloud Run memory: `--memory 16Gi` - Use larger VM instance type - Enable model quantization (AWQ/BitsAndBytes) ### Cold Starts (Cloud Run) - Use Cloud Run min-instances to keep warm - Pre-warm models on startup - Consider Compute Engine for always-on workloads ## Security 1. **Authentication**: Use Cloud Run authentication or Cloud IAP for Compute Engine 2. **Secrets**: Store HF_TOKEN in Secret Manager 3. **Firewall**: Restrict access to specific IP ranges 4. **HTTPS**: Use Cloud Load Balancer with SSL certificate ## Next Steps 1. Set up Cloud Load Balancer for HTTPS 2. Configure monitoring and alerts 3. Set up CI/CD with Cloud Build 4. Use Cloud Storage for model caching 5. Implement auto-scaling policies