🩺 MedLLM-10M

A lightweight GPT-2-style causal language model trained from scratch on medical literature. Designed for research and education in the medical domain.

⚠️ Disclaimer. This model is for research and educational purposes only. It must not be used for medical diagnosis, treatment recommendations, or clinical decision-making.

TL;DR


Parameters	~27.7M (10M body, rest in embeddings)
Architecture	GPT-2 (`GPT2LMHeadModel`)
Training	From scratch — no base model
Vocabulary	5,000 (custom medical tokenizer)
Context length	512 tokens
Domain	Medical / clinical / biomedical English
Hardware	NVIDIA RTX 3060 12GB
License	Apache 2.0

Why a tiny medical model?

Large medical LLMs (PubMedBERT, BioGPT, Med-PaLM) are expensive to run and require infrastructure most clinics, students, and researchers don't have. MedLLM-10M is a deliberately tiny, from-scratch GPT-2 trained on medical text — small enough to run on a CPU, finetune on a laptop, or use as a teaching artifact in an NLP course.

It is not competitive with commercial medical AI for diagnostic accuracy. It is competitive at being a small, transparent, fully open baseline for the medical domain.

Architecture

Standard GPT-2 with small dimensions:

Architecture:          GPT2LMHeadModel
Layers:                8
Hidden size (n_embd):  512
Attention heads:       8
Feed-forward (n_inner):2,048
Max position (n_ctx):  512
Activation:            GELU
Vocab size:            5,000
Dropout:               0.1

Training

Data sources:
- PubMed abstracts and research papers
- Medical journal articles
- Clinical practice guidelines
- Medical Q&A datasets
- Healthcare reference content (e.g., Mayo Clinic, WebMD)
Tokenizer: custom 5,000-token vocabulary fit to the medical corpus
Training framework: Hugging Face Transformers (PyTorch)
Hardware: NVIDIA RTX 3060 12GB
Hyperparameters:


Epochs	10
Batch size	4
Gradient accumulation	8
Learning rate	3e-4
Optimizer	AdamW
Weight decay	0.01
Warmup steps	200
Mixed precision	FP16
Sequence stride	256

No pretrained weights. No transfer learning. The model is initialized fresh and trained end-to-end on medical text.

Usage

Loads with standard transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "raihan-js/medllm-10m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "The patient presents with chest pain and"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=120, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Intended use

Research baseline for medical-domain pretraining from scratch
Teaching artifact for NLP / healthcare AI courses
Embedding/feature extraction for downstream medical text tasks (with fine-tuning)
Reference for how to build domain-specific tokenizers and train a small LM from scratch

Out-of-scope / hard limitations

❌ Not for clinical use. No diagnosis, treatment, triage, dosing, or any patient-facing application.
❌ No safety / alignment training. No RLHF, no harmlessness training.
❌ Hallucinates. It will fabricate medical claims confidently. Treat all outputs as untrusted text.
❌ English only. Trained exclusively on English medical literature.

Related models

raihan-js/orch-fusion and the rest of the ORCH series — sibling from-scratch SLMs, in the code-generation domain

Author

Akteruzzaman Raihan Sikder — AI/ML engineer, CTO at ClarioScope AI (HIPAA-compliant healthcare practice growth platform). Portfolio · GitHub.

Citation

@misc{sikder2025medllm10m,
  title  = {MedLLM-10M: A Lightweight GPT-2-Style Language Model Trained From Scratch on Medical Literature},
  author = {Sikder, Akteruzzaman Raihan},
  year   = {2025},
  url    = {https://huggingface.co/raihan-js/medllm-10m}
}

Downloads last month: 21

Safetensors

Model size

28M params

Tensor type

F32

raihan-js
/

medllm-10m