MedGemma 1.5 4B LoRA Adapter (Rank 8)

Optimized for Strict RAM Constraints & Fast Edge Inference (8GB VRAM)

This repository contains a Low-Rank Adaptation (LoRA) Rank-8 adapter built on top of:

google/medgemma-1.5-4b-it

The goal of this adapter is simple:

Run advanced multimodal medical reasoning on consumer GPUs (8GB VRAM)
Avoid Out-Of-Memory (OOM) crashes
Maintain fast token generation speed
Keep medical reasoning capability

This adapter is specifically engineered for low-memory deployment environments such as:

RTX 3060 8GB
RTX 4060 8GB
T4 16GB (with large headroom)
Edge AI systems
Research laptops

Why Rank-8?

In LoRA training, the rank (r) determines how many additional trainable parameters are added.

Higher rank = more expressive power
Lower rank = less memory + faster inference

We intentionally selected: LoRA Rank (r) = 8 Alpha (α) = 16

Why?

Because we operate under strict RAM constraints.

If we increase rank to 32 or 64:

VRAM usage increases
Inference latency increases
Risk of OOM spikes increases

Rank-8 keeps:

Trainable parameters under ~1%
Adapter size under ~50MB
High tokens-per-second output
Stable performance under 8GB VRAM

This is a practical engineering decision, not just a research choice.

Memory Architecture Strategy

We use QLoRA (Quantized LoRA) with 4-bit quantization.

Memory Breakdown (Approximate)

Component	FP16 Full Model	This QLoRA Setup
Base Model Weights	~8.5 GB	~2.6 GB
Trainable Parameters	4B	~12M
Adapter Size	N/A	~50 MB
Minimum Safe VRAM	24GB+	6-8GB

This leaves ~4–5GB free VRAM headroom for:

KV Cache
Image embeddings
Context expansion
Forward pass activations

This prevents runtime memory spikes.

🛠 Technical Specifications

Feature	Configuration
Base Model	google/medgemma-1.5-4b-it
Architecture	SigLIP Vision Encoder + Gemma 3 4B Decoder
Adapter Method	LoRA (QLoRA)
LoRA Rank (r)	8
LoRA Alpha	16
Target Modules	`q_proj`, `v_proj`, `o_proj`
Quantization	4-bit NF4
Compute Dtype	float16
Optimizer	paged_adamw_8bit
Recommended Context	1024–2048 tokens (8GB GPU safe zone)

Why This Model Is Fast

4-bit NF4 Quantization

Reduces memory bandwidth usage → Faster weight loading → Lower VRAM pressure

Low Rank (r=8)

Fewer matrix multiplications during attention updates → Faster forward pass

Frozen Base Model

Only small adapter layers are updated → Efficient computation

Greedy Decoding Recommendation

Using:

do_sample=False

Provides maximum inference speed and stable medical fact extraction.

Quick Start (8GB Safe Setup)

import torch
from transformers import AutoModelForImageTextToText, BitsAndBytesConfig, AutoProcessor
from peft import PeftModel

model_id = "google/medgemma-1.5-4b-it"
adapter_id = "nagireddy5/medgemma-1.5-4b-lora-adapter-rank-8"

# 4-bit memory configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load base model in 4-bit
base_model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Attach Rank-8 Adapter
model = PeftModel.from_pretrained(base_model, adapter_id)
processor = AutoProcessor.from_pretrained(model_id)

print("MedGemma 1.5 Rank-8 Adapter Ready (8GB Optimized)")

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for nagireddy5/medgemma-1.5-4b-lora-adapter-rank-8

Base model

google/medgemma-1.5-4b-it

Adapter

(38)

this model