MedGemma 1.5 4B LoRA Adapter (Rank 8)

Optimized for Strict RAM Constraints & Fast Edge Inference (8GB VRAM)

This repository contains a Low-Rank Adaptation (LoRA) Rank-8 adapter built on top of:

google/medgemma-1.5-4b-it

The goal of this adapter is simple:

Run advanced multimodal medical reasoning on consumer GPUs (8GB VRAM)
Avoid Out-Of-Memory (OOM) crashes
Maintain fast token generation speed
Keep medical reasoning capability

This adapter is specifically engineered for low-memory deployment environments such as:

  • RTX 3060 8GB
  • RTX 4060 8GB
  • T4 16GB (with large headroom)
  • Edge AI systems
  • Research laptops

Why Rank-8?

In LoRA training, the rank (r) determines how many additional trainable parameters are added.

Higher rank = more expressive power
Lower rank = less memory + faster inference

We intentionally selected: LoRA Rank (r) = 8 Alpha (Ξ±) = 16

Why?

Because we operate under strict RAM constraints.

If we increase rank to 32 or 64:

  • VRAM usage increases
  • Inference latency increases
  • Risk of OOM spikes increases

Rank-8 keeps:

  • Trainable parameters under ~1%
  • Adapter size under ~50MB
  • High tokens-per-second output
  • Stable performance under 8GB VRAM

This is a practical engineering decision, not just a research choice.


Memory Architecture Strategy

We use QLoRA (Quantized LoRA) with 4-bit quantization.

Memory Breakdown (Approximate)

Component FP16 Full Model This QLoRA Setup
Base Model Weights ~8.5 GB ~2.6 GB
Trainable Parameters 4B ~12M
Adapter Size N/A ~50 MB
Minimum Safe VRAM 24GB+ 6-8GB

This leaves ~4–5GB free VRAM headroom for:

  • KV Cache
  • Image embeddings
  • Context expansion
  • Forward pass activations

This prevents runtime memory spikes.


πŸ›  Technical Specifications

Feature Configuration
Base Model google/medgemma-1.5-4b-it
Architecture SigLIP Vision Encoder + Gemma 3 4B Decoder
Adapter Method LoRA (QLoRA)
LoRA Rank (r) 8
LoRA Alpha 16
Target Modules q_proj, v_proj, o_proj
Quantization 4-bit NF4
Compute Dtype float16
Optimizer paged_adamw_8bit
Recommended Context 1024–2048 tokens (8GB GPU safe zone)

Why This Model Is Fast

4-bit NF4 Quantization

Reduces memory bandwidth usage β†’ Faster weight loading β†’ Lower VRAM pressure

Low Rank (r=8)

Fewer matrix multiplications during attention updates β†’ Faster forward pass

Frozen Base Model

Only small adapter layers are updated β†’ Efficient computation

Greedy Decoding Recommendation

Using:

do_sample=False

Provides maximum inference speed and stable medical fact extraction.


Quick Start (8GB Safe Setup)

import torch
from transformers import AutoModelForImageTextToText, BitsAndBytesConfig, AutoProcessor
from peft import PeftModel

model_id = "google/medgemma-1.5-4b-it"
adapter_id = "nagireddy5/medgemma-1.5-4b-lora-adapter-rank-8"

# 4-bit memory configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load base model in 4-bit
base_model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Attach Rank-8 Adapter
model = PeftModel.from_pretrained(base_model, adapter_id)
processor = AutoProcessor.from_pretrained(model_id)

print("MedGemma 1.5 Rank-8 Adapter Ready (8GB Optimized)")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nagireddy5/medgemma-1.5-4b-lora-adapter-rank-8

Adapter
(38)
this model