ViT-S + LoRA for CIFAR-100

This model is a Vision Transformer Small (vit_small_patch16_224) adapted with LoRA for 100-class image classification on CIFAR-100.

It is intended for users who want to download a ready-to-use classifier and run inference.

Model At A Glance

Architecture: ViT-Small (vit_small_patch16_224)
Adaptation: LoRA (PEFT)
Task: image classification
Labels: 100 classes
Dataset: CIFAR-100
Best configuration: r=8, alpha=8, dropout=0.1

Model Structure After Fine-Tuning

The uploaded weights are for a ViT-S model with LoRA adapters and a trainable classification head.

Backbone: vit_small_patch16_224 (pretrained ViT-S)
LoRA adapters: injected into fused attention qkv projection layers
Classification head: output dimension updated to 100 classes
Parameter-efficient setup: base backbone mostly frozen, LoRA + head trained

Fine-tuned architecture summary

Component	Status After Fine-tuning
Patch embedding + transformer backbone	Frozen (except LoRA pathways)
Attention `qkv` projection	LoRA adapters active
MLP blocks	Frozen
Classification head	Trainable
Output classes	100 (CIFAR-100)

The uploaded model.pt is a state dict that contains both base model parameters and LoRA parameters.

Performance

Metric	Value
Top-1 Accuracy	89.74%
Best Validation Accuracy	89.14%
Trainable Parameters	185,956

Recommended Use Cases

Direct image classification inference with a ViT+LoRA model
Educational demos for Hugging Face model download and inference
Baseline testing on CIFAR-100-style natural images

Out-of-Scope Use Cases

Safety-critical decision making
Medical/legal/financial decisions
Deployment to domains very different from CIFAR-style natural images without additional validation

Limitations

Evaluated only on CIFAR-100
Accuracy may drop on real-world or shifted data distributions
Not designed as a robustness- or fairness-optimized model

Bias, Risks, and Responsible Use

Class imbalance and dataset artifacts may influence predictions
Misclassifications are expected for ambiguous or out-of-distribution inputs
Always include human review before real-world decisions

How To Use

This model repo stores:

model.pt (PyTorch state dict)
config.json (model + LoRA settings)

To use the model, you must recreate the same architecture and then load model.pt.

Install dependencies

pip install torch timm peft huggingface_hub torchvision pillow

Download from Hugging Face and load

import torch
import timm
from peft import LoraConfig, get_peft_model
from huggingface_hub import hf_hub_download


def build_model(num_classes=100, rank=8, alpha=8, lora_dropout=0.1):
  # 1) Base ViT-S
  model = timm.create_model("vit_small_patch16_224", pretrained=False, num_classes=num_classes)

  # 2) Freeze base params
  for p in model.parameters():
    p.requires_grad = False

  # 3) Keep head trainable
  for p in model.head.parameters():
    p.requires_grad = True

  # 4) Attach LoRA to fused qkv projection
  lora_cfg = LoraConfig(
    r=rank,
    lora_alpha=alpha,
    lora_dropout=lora_dropout,
    target_modules=["qkv"],
    bias="none",
    modules_to_save=["head"],
  )
  model = get_peft_model(model, lora_cfg)
  return model

repo_id = "priyadip/vit-lora-cifar100"
weights_path = hf_hub_download(repo_id=repo_id, filename="model.pt")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = build_model(
  num_classes=100,
  rank=8,
  alpha=8,
  lora_dropout=0.1,
).to(device)

state_dict = torch.load(weights_path, map_location=device)
model.load_state_dict(state_dict)
model.eval()

Minimal inference example

import torch
from PIL import Image
import torchvision.transforms as transforms

# Use the same normalization expected by this model
transform = transforms.Compose([
  transforms.Resize(224),
  transforms.ToTensor(),
  transforms.Normalize(mean=[0.5071, 0.4867, 0.4408],
             std=[0.2675, 0.2565, 0.2761]),
])

img = Image.open("your_image.png").convert("RGB")
x = transform(img).unsqueeze(0).to(next(model.parameters()).device)

with torch.no_grad():
  logits = model(x)
  pred = logits.argmax(dim=1).item()

print("Predicted class index:", pred)

Note: the output is a CIFAR-100 class index (0-99). Map indices to class names according to your CIFAR-100 label list.

If needed, fetch config directly from the repo:

from huggingface_hub import hf_hub_download
import json

cfg_path = hf_hub_download(repo_id="priyadip/vit-lora-cifar100", filename="config.json")
cfg = json.load(open(cfg_path))
print(cfg)

Visuals

What We Can Observe From These Visuals

Training curves: The train/validation curves improve smoothly and stay close, which suggests stable optimization and good generalization for this setup.
Class-wise accuracy: Performance is strong across most CIFAR-100 classes, indicating that adaptation is not concentrated on only a few easy categories.
LoRA gradient norms: Gradients remain active and controlled during training, showing that LoRA adapters receive useful updates without unstable spikes.

Why LoRA ?

LoRA adds small trainable low-rank adapters inside attention layers instead of updating the full backbone. This gives three practical advantages:

Parameter efficiency: far fewer trainable parameters than full fine-tuning.
Stable transfer: pretrained ViT knowledge is largely preserved while adapters specialize to CIFAR-100.
Better compute/memory trade-off: strong accuracy with lighter optimization cost.

These plots support that LoRA provides an effective balance between accuracy and training efficiency for this model.

Framework Versions

PyTorch
timm
PEFT (LoRA)

Citation

ViT paper (original architecture): https://arxiv.org/abs/2010.11929
timm library (implementation used): https://github.com/huggingface/pytorch-image-models
timm ViT-S model entry (reference checkpoint page): https://huggingface.co/timm/vit_small_patch16_224.augreg_in21k_ft_in1k

@inproceedings{dosovitskiy2021vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2021}
}

@inproceedings{hu2022lora,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2022}
}

@misc{mangrulkar2022peft,
  title={PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods},
  author={Mangrulkar, Sourab and Paul, Sayak and others},
  year={2022},
  howpublished={\url{https://github.com/huggingface/peft}}
}

@misc{rw2019timm,
  title={PyTorch Image Models},
  author={Wightman, Ross},
  year={2019},
  howpublished={\url{https://github.com/huggingface/pytorch-image-models}}
}

Downloads last month: 133

Model tree for priyadip/vit-lora-cifar100

Base model

timm/vit_small_patch16_224.augreg_in21k_ft_in1k

Adapter

(1)

this model

Dataset used to train priyadip/vit-lora-cifar100

Paper for priyadip/vit-lora-cifar100

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper • 2010.11929 • Published Oct 22, 2020 • 15

Evaluation results

Top-1 Accuracy on CIFAR-100
self-reported

89.740