MedGemma 1.5 4B LoRA Adapter (Rank 8)
Optimized for Strict RAM Constraints & Fast Edge Inference (8GB VRAM)
This repository contains a Low-Rank Adaptation (LoRA) Rank-8 adapter built on top of:
google/medgemma-1.5-4b-it
The goal of this adapter is simple:
Run advanced multimodal medical reasoning on consumer GPUs (8GB VRAM)
Avoid Out-Of-Memory (OOM) crashes
Maintain fast token generation speed
Keep medical reasoning capability
This adapter is specifically engineered for low-memory deployment environments such as:
- RTX 3060 8GB
- RTX 4060 8GB
- T4 16GB (with large headroom)
- Edge AI systems
- Research laptops
Why Rank-8?
In LoRA training, the rank (r) determines how many additional trainable parameters are added.
Higher rank = more expressive power
Lower rank = less memory + faster inference
We intentionally selected: LoRA Rank (r) = 8 Alpha (Ξ±) = 16
Why?
Because we operate under strict RAM constraints.
If we increase rank to 32 or 64:
- VRAM usage increases
- Inference latency increases
- Risk of OOM spikes increases
Rank-8 keeps:
- Trainable parameters under ~1%
- Adapter size under ~50MB
- High tokens-per-second output
- Stable performance under 8GB VRAM
This is a practical engineering decision, not just a research choice.
Memory Architecture Strategy
We use QLoRA (Quantized LoRA) with 4-bit quantization.
Memory Breakdown (Approximate)
| Component | FP16 Full Model | This QLoRA Setup |
|---|---|---|
| Base Model Weights | ~8.5 GB | ~2.6 GB |
| Trainable Parameters | 4B | ~12M |
| Adapter Size | N/A | ~50 MB |
| Minimum Safe VRAM | 24GB+ | 6-8GB |
This leaves ~4β5GB free VRAM headroom for:
- KV Cache
- Image embeddings
- Context expansion
- Forward pass activations
This prevents runtime memory spikes.
π Technical Specifications
| Feature | Configuration |
|---|---|
| Base Model | google/medgemma-1.5-4b-it |
| Architecture | SigLIP Vision Encoder + Gemma 3 4B Decoder |
| Adapter Method | LoRA (QLoRA) |
| LoRA Rank (r) | 8 |
| LoRA Alpha | 16 |
| Target Modules | q_proj, v_proj, o_proj |
| Quantization | 4-bit NF4 |
| Compute Dtype | float16 |
| Optimizer | paged_adamw_8bit |
| Recommended Context | 1024β2048 tokens (8GB GPU safe zone) |
Why This Model Is Fast
4-bit NF4 Quantization
Reduces memory bandwidth usage β Faster weight loading β Lower VRAM pressure
Low Rank (r=8)
Fewer matrix multiplications during attention updates β Faster forward pass
Frozen Base Model
Only small adapter layers are updated β Efficient computation
Greedy Decoding Recommendation
Using:
do_sample=False
Provides maximum inference speed and stable medical fact extraction.
Quick Start (8GB Safe Setup)
import torch
from transformers import AutoModelForImageTextToText, BitsAndBytesConfig, AutoProcessor
from peft import PeftModel
model_id = "google/medgemma-1.5-4b-it"
adapter_id = "nagireddy5/medgemma-1.5-4b-lora-adapter-rank-8"
# 4-bit memory configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
# Load base model in 4-bit
base_model = AutoModelForImageTextToText.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
# Attach Rank-8 Adapter
model = PeftModel.from_pretrained(base_model, adapter_id)
processor = AutoProcessor.from_pretrained(model_id)
print("MedGemma 1.5 Rank-8 Adapter Ready (8GB Optimized)")
Model tree for nagireddy5/medgemma-1.5-4b-lora-adapter-rank-8
Base model
google/medgemma-1.5-4b-it