Hungarian Date Converter - LoRA Adapter (mT5-base)

This repository contains LoRA adapters for a modified mT5-base model fine-tuned on Hungarian date conversion tasks. The adapters can convert written dates to numeric format and vice versa.

Model Details

  • Base Model: GaborMadarasz/hut5-base (mT5-base)
  • Adapter Type: LoRA (Low-Rank Adaptation)
  • Task: Bidirectional date conversion in Hungarian
    • word2date: Written date → Numeric format (e.g., "ezerkilencszáztizenegy május tizenöt" → "1911. május 15.")
    • date2word: Numeric date → Written format (e.g., "1978. október 1." → "ezerkilencszázhetvennyolc október első")
  • Developer: SilentSynapse
  • Release Date: 2026

Model Architecture

LoRA Configuration

LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q", "v", "k", "o"],
    task_type="SEQ_2_SEQ_LM"
)
Parameter Value
Trainable Parameters 3,538,944 (1.43% of total)
Total Parameters 247,848,192
Adapter Size ~14 MB
Base Model Size ~950 MB (not included)

Training Configuration

Parameter Value
Batch Size 6 (gradient accumulation: 3, effective: 18)
Learning Rate 1.5e-4
Epochs 2
Max Length 128 (input), 64 (target)
Precision FP16
Hardware NVIDIA RTX 3060 12GB
Framework Hugging Face Transformers + PEFT

Training Data

  • Training Samples: 756,000
  • Evaluation Samples: 1000
  • Language: Hungarian
  • Source: Hungarian Wikipedia articles containing date patterns
  • Dataset:

Evaluation Results

Metrics (64-token target limit)

Metric Overall word2date date2word
Exact Match 75.70% 85.69% 65.01%
Word Error Rate (WER) 1.04% 0.61% 1.56%
Char Error Rate (CER) 0.37% 0.20% 0.54%
ROUGE-1 F1 0.9956 - -
ROUGE-2 F1 0.9949 - -
ROUGE-L F1 0.9956 - -

Performance Notes

  • word2date performs better due to token efficiency (numeric formats are shorter)
  • date2word has lower accuracy due to longer Hungarian number words consuming the 64-token limit
  • Low WER/CER indicates most errors are minor typos, not structural failures

Usage

Requirements

pip install transformers peft torch

Basic Inference

from peft import PeftModel
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Load base model and adapter
model_id = "GaborMadarasz/hut5-base"
adapter_path = "SilentSynapse/hut5-date-converter"

base_model = AutoModelForSeq2SeqLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16
)
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload().to("cuda").eval()

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Inference function
def convert_date(text, mode="word2date"):
    """
    mode: "word2date" or "date2word"
    """
    prompt = f"{mode}: {text}"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=64,
            num_beams=4,
            early_stopping=True
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Examples
print(convert_date("ezerkilencszáztizenegy május tizenöt", "word2date"))
# Output: 1911. május 15.

print(convert_date("1978. október 5", "date2word"))
# Output: ezerkilencszázhetvennyolc október öt

Batch Inference

def convert_batch(texts, modes):
    prompts = [f"{m}: {t}" for m, t in zip(modes, texts)]
    inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=64, num_beams=4)
    
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

texts = ["ezerkilencszáztizenegy május tizenöt", "1978. október 5."]
modes = ["word2date", "date2word"]
results = convert_batch(texts, modes)

CPU Inference (no GPU)

# Remove .to("cuda") and use float32
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload().eval()

Limitations

  1. Output Length: The model is optimized for 64-token outputs. Longer texts may be truncated.

  2. date2word Performance: Written date conversion has lower accuracy (65% EM) due to complex Hungarian number words.

  3. Date Formats: Follows Hungarian conventions only (e.g., "1991. május 15."). Other formats (ISO 8601, US style) are not supported.

  4. Context Preservation: The model converts dates while preserving surrounding text. Very long passages may lose information due to the token limit.

  5. Language: Hungarian only. Not suitable for other languages.

  6. Edge Cases:

    • Roman numerals (e.g., "XIX. század") are not converted
    • Date ranges (1978–1991) may have inconsistent conversion

Intended Use Cases

  • Date normalization in Hungarian text processing pipelines
  • Information extraction from historical documents
  • Text preprocessing for NLP tasks requiring standardized dates
  • Digital humanities projects involving Hungarian archives
  • Accessibility - converting dates to formats suitable for screen readers

Out-of-Scope Use Cases

  • Non-Hungarian language date conversion
  • Non-Gregorian calendar systems
  • Time/timestamp conversion (only dates are supported)
  • Mathematical operations on dates
  • Named entity recognition (the model does not identify dates, only converts them)

Future Improvements

  • 256-token target support for longer texts
  • Additional date format support (ISO 8601, US, EU standards)
  • Improved date2word performance (more training epochs, larger dataset)
  • Quantization support (INT8, INT4) for lower VRAM usage

License

MIT License

Authors

  • SilentSynapse

References

Version History

Version Date Changes
1.0 2026-04 Initial LoRA adapter release
```
Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SilentSynapse/hut5-date-converter

Finetuned
(2)
this model

Papers for SilentSynapse/hut5-date-converter