๐Ÿฉบ MedLLM-10M

A lightweight GPT-2-style causal language model trained from scratch on medical literature. Designed for research and education in the medical domain.

โš ๏ธ Disclaimer. This model is for research and educational purposes only. It must not be used for medical diagnosis, treatment recommendations, or clinical decision-making.

License Trained on


TL;DR

Parameters ~27.7M (10M body, rest in embeddings)
Architecture GPT-2 (GPT2LMHeadModel)
Training From scratch โ€” no base model
Vocabulary 5,000 (custom medical tokenizer)
Context length 512 tokens
Domain Medical / clinical / biomedical English
Hardware NVIDIA RTX 3060 12GB
License Apache 2.0

Why a tiny medical model?

Large medical LLMs (PubMedBERT, BioGPT, Med-PaLM) are expensive to run and require infrastructure most clinics, students, and researchers don't have. MedLLM-10M is a deliberately tiny, from-scratch GPT-2 trained on medical text โ€” small enough to run on a CPU, finetune on a laptop, or use as a teaching artifact in an NLP course.

It is not competitive with commercial medical AI for diagnostic accuracy. It is competitive at being a small, transparent, fully open baseline for the medical domain.

Architecture

Standard GPT-2 with small dimensions:

Architecture:          GPT2LMHeadModel
Layers:                8
Hidden size (n_embd):  512
Attention heads:       8
Feed-forward (n_inner):2,048
Max position (n_ctx):  512
Activation:            GELU
Vocab size:            5,000
Dropout:               0.1

Training

  • Data sources:
    • PubMed abstracts and research papers
    • Medical journal articles
    • Clinical practice guidelines
    • Medical Q&A datasets
    • Healthcare reference content (e.g., Mayo Clinic, WebMD)
  • Tokenizer: custom 5,000-token vocabulary fit to the medical corpus
  • Training framework: Hugging Face Transformers (PyTorch)
  • Hardware: NVIDIA RTX 3060 12GB
  • Hyperparameters:
Epochs 10
Batch size 4
Gradient accumulation 8
Learning rate 3e-4
Optimizer AdamW
Weight decay 0.01
Warmup steps 200
Mixed precision FP16
Sequence stride 256

No pretrained weights. No transfer learning. The model is initialized fresh and trained end-to-end on medical text.

Usage

Loads with standard transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "raihan-js/medllm-10m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "The patient presents with chest pain and"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=120, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Intended use

  • Research baseline for medical-domain pretraining from scratch
  • Teaching artifact for NLP / healthcare AI courses
  • Embedding/feature extraction for downstream medical text tasks (with fine-tuning)
  • Reference for how to build domain-specific tokenizers and train a small LM from scratch

Out-of-scope / hard limitations

  • โŒ Not for clinical use. No diagnosis, treatment, triage, dosing, or any patient-facing application.
  • โŒ No safety / alignment training. No RLHF, no harmlessness training.
  • โŒ Hallucinates. It will fabricate medical claims confidently. Treat all outputs as untrusted text.
  • โŒ English only. Trained exclusively on English medical literature.

Related models

Author

Akteruzzaman Raihan Sikder โ€” AI/ML engineer, CTO at ClarioScope AI (HIPAA-compliant healthcare practice growth platform). Portfolio ยท GitHub.

Citation

@misc{sikder2025medllm10m,
  title  = {MedLLM-10M: A Lightweight GPT-2-Style Language Model Trained From Scratch on Medical Literature},
  author = {Sikder, Akteruzzaman Raihan},
  year   = {2025},
  url    = {https://huggingface.co/raihan-js/medllm-10m}
}
Downloads last month
21
Safetensors
Model size
28M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 1 Ask for provider support

Dataset used to train raihan-js/medllm-10m