Hungarian Date Converter - LoRA Adapter (mT5-base)
This repository contains LoRA adapters for a modified mT5-base model fine-tuned on Hungarian date conversion tasks. The adapters can convert written dates to numeric format and vice versa.
Model Details
- Base Model: GaborMadarasz/hut5-base (mT5-base)
- Adapter Type: LoRA (Low-Rank Adaptation)
- Task: Bidirectional date conversion in Hungarian
word2date: Written date → Numeric format (e.g., "ezerkilencszáztizenegy május tizenöt" → "1911. május 15.")date2word: Numeric date → Written format (e.g., "1978. október 1." → "ezerkilencszázhetvennyolc október első")
- Developer: SilentSynapse
- Release Date: 2026
Model Architecture
LoRA Configuration
LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q", "v", "k", "o"],
task_type="SEQ_2_SEQ_LM"
)
| Parameter | Value |
|---|---|
| Trainable Parameters | 3,538,944 (1.43% of total) |
| Total Parameters | 247,848,192 |
| Adapter Size | ~14 MB |
| Base Model Size | ~950 MB (not included) |
Training Configuration
| Parameter | Value |
|---|---|
| Batch Size | 6 (gradient accumulation: 3, effective: 18) |
| Learning Rate | 1.5e-4 |
| Epochs | 2 |
| Max Length | 128 (input), 64 (target) |
| Precision | FP16 |
| Hardware | NVIDIA RTX 3060 12GB |
| Framework | Hugging Face Transformers + PEFT |
Training Data
- Training Samples: 756,000
- Evaluation Samples: 1000
- Language: Hungarian
- Source: Hungarian Wikipedia articles containing date patterns
- Dataset:
Evaluation Results
Metrics (64-token target limit)
| Metric | Overall | word2date | date2word |
|---|---|---|---|
| Exact Match | 75.70% | 85.69% | 65.01% |
| Word Error Rate (WER) | 1.04% | 0.61% | 1.56% |
| Char Error Rate (CER) | 0.37% | 0.20% | 0.54% |
| ROUGE-1 F1 | 0.9956 | - | - |
| ROUGE-2 F1 | 0.9949 | - | - |
| ROUGE-L F1 | 0.9956 | - | - |
Performance Notes
- word2date performs better due to token efficiency (numeric formats are shorter)
- date2word has lower accuracy due to longer Hungarian number words consuming the 64-token limit
- Low WER/CER indicates most errors are minor typos, not structural failures
Usage
Requirements
pip install transformers peft torch
Basic Inference
from peft import PeftModel
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
# Load base model and adapter
model_id = "GaborMadarasz/hut5-base"
adapter_path = "SilentSynapse/hut5-date-converter"
base_model = AutoModelForSeq2SeqLM.from_pretrained(
model_id,
torch_dtype=torch.float16
)
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload().to("cuda").eval()
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Inference function
def convert_date(text, mode="word2date"):
"""
mode: "word2date" or "date2word"
"""
prompt = f"{mode}: {text}"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=64,
num_beams=4,
early_stopping=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Examples
print(convert_date("ezerkilencszáztizenegy május tizenöt", "word2date"))
# Output: 1911. május 15.
print(convert_date("1978. október 5", "date2word"))
# Output: ezerkilencszázhetvennyolc október öt
Batch Inference
def convert_batch(texts, modes):
prompts = [f"{m}: {t}" for m, t in zip(modes, texts)]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
with torch.no_grad():
outputs = model.generate(**inputs, max_length=64, num_beams=4)
return tokenizer.batch_decode(outputs, skip_special_tokens=True)
texts = ["ezerkilencszáztizenegy május tizenöt", "1978. október 5."]
modes = ["word2date", "date2word"]
results = convert_batch(texts, modes)
CPU Inference (no GPU)
# Remove .to("cuda") and use float32
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload().eval()
Limitations
Output Length: The model is optimized for 64-token outputs. Longer texts may be truncated.
date2word Performance: Written date conversion has lower accuracy (65% EM) due to complex Hungarian number words.
Date Formats: Follows Hungarian conventions only (e.g., "1991. május 15."). Other formats (ISO 8601, US style) are not supported.
Context Preservation: The model converts dates while preserving surrounding text. Very long passages may lose information due to the token limit.
Language: Hungarian only. Not suitable for other languages.
Edge Cases:
- Roman numerals (e.g., "XIX. század") are not converted
- Date ranges (1978–1991) may have inconsistent conversion
Intended Use Cases
- Date normalization in Hungarian text processing pipelines
- Information extraction from historical documents
- Text preprocessing for NLP tasks requiring standardized dates
- Digital humanities projects involving Hungarian archives
- Accessibility - converting dates to formats suitable for screen readers
Out-of-Scope Use Cases
- Non-Hungarian language date conversion
- Non-Gregorian calendar systems
- Time/timestamp conversion (only dates are supported)
- Mathematical operations on dates
- Named entity recognition (the model does not identify dates, only converts them)
Future Improvements
- 256-token target support for longer texts
- Additional date format support (ISO 8601, US, EU standards)
- Improved date2word performance (more training epochs, larger dataset)
- Quantization support (INT8, INT4) for lower VRAM usage
License
MIT License
Authors
- SilentSynapse
References
- Base Model: GaborMadarasz/hut5-base
- Dataset:
- PEFT Library: Hugging Face PEFT
- mT5 Paper: Multilingual T5
- LoRA Paper: Low-Rank Adaptation
Version History
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2026-04 | Initial LoRA adapter release |
| ``` |
- Downloads last month
- 15
Model tree for SilentSynapse/hut5-date-converter
Base model
GaborMadarasz/hut5-base