# Documentation Chatbot with Meta Synthetic Data Kit

_Authored by: [Alan Ponnachan](https://huggingface.co/AlanPonnachan)_

This notebook demonstrates a practical approach to building a domain-specific Question & Answering chatbot. We'll focus on creating a chatbot that can answer questions about a specific piece of documentation – in this case, LangChain's documentation on Chat Models.

**Goal:** To fine-tune a small, efficient Language Model (LLM) to understand and answer questions about the LangChain Chat Models documentation.

**Approach:**
1.  **Data Acquisition:** Obtain the text content from the target LangChain documentation page.
2.  **Synthetic Data Generation:** Use Meta's `synthetic-data-kit` to automatically generate Question/Answer pairs from this documentation.
3.  **Efficient Fine-tuning:** Employ Unsloth and `Hugging Face's TRL SFTTrainer` to efficiently fine-tune a Llama-3.2-3B model on the generated synthetic data.
4.  **Evaluation:** Test the fine-tuned model with specific questions about the documentation.

This method allows us to adapt an LLM to a niche domain without requiring a large, manually curated dataset.

**Hardware Used:**

This notebook was run on Google Colab (Free Tier) with an NVIDIA T4 GPU

## 1. Setup and Installation

First, we need to install the necessary libraries. We'll use `unsloth` for efficient model handling and training, and `synthetic-data-kit` for generating our training data.

```python
%%capture
# In Colab, we skip dependency installation to avoid conflicts with preinstalled packages.
# On local machines, we include dependencies for completeness.

import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm==0.8.2
else:

    !pip install --no-deps unsloth vllm==0.8.2

# Get https://github.com/meta-llama/synthetic-data-kit
!pip install synthetic-data-kit
```

```python
%%capture
import os
if "COLAB_" in "".join(os.environ.keys()):

    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub[hf_xet] hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers|importlib_metadata)[^\n]{0,}\n", b"", f))
    !pip install -r vllm_requirements.txt
```

## 2. Synthetic Data Generation

We'll use `SyntheticDataKit` from Unsloth (which wraps Meta's `synthetic-data-kit`) to create Question/Answer pairs from our chosen documentation.

```python
>>> from unsloth.dataprep import SyntheticDataKit

>>> generator = SyntheticDataKit.from_pretrained(

...     model_name = "unsloth/Llama-3.2-3B-Instruct",
...     max_seq_length = 2048,
... )
```

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-05 15:14:48 [__init__.py:239] Automatically detected platform cuda.

```python
generator.prepare_qa_generation(
    output_folder = "data", # Output location of synthetic data
    temperature = 0.7, # Higher temp makes more diverse data
    top_p = 0.95,
    overlap = 64, # Overlap portion during chunking
    max_generation_tokens = 512, # Can increase for longer QA pairs
)
```

```python
>>> !synthetic-data-kit system-check
```

[?25l[32m VLLM server is running at [0m[4;94mhttp://localhost:8000/v1[0m
[32m⠋[0m[32m Checking VLLM server at http://localhost:8000/v1...[0m
[2KAvailable models: [1m{[0m[32m'object'[0m: [32m'list'[0m, [32m'data'[0m: [1m[[0m[1m{[0m[32m'id'[0m: 
[32m'unsloth/Llama-3.2-3B-Instruct'[0m, [32m'object'[0m: [32m'model'[0m, [32m'created'[0m: [1;36m1746459182[0m, 
[32m'owned_by'[0m: [32m'vllm'[0m, [32m'root'[0m: [32m'unsloth/Llama-3.2-3B-Instruct'[0m, [32m'parent'[0m: [3;35mNone[0m, 
[32m'max_model_len'[0m: [1;36m2048[0m, [32m'permission'[0m: [1m[[0m[1m{[0m[32m'id'[0m: 
[32m'modelperm-5296f16bbd3c425a82af4d2f84f0cbfe'[0m, [32m'object'[0m: [32m'model_permission'[0m, 
[32m'created'[0m: [1;36m1746459182[0m, [32m'allow_create_engine'[0m: [3;91mFalse[0m, [32m'allow_sampling'[0m: [3;92mTrue[0m, 
[32m'allow_logprobs'[0m: [3;92mTrue[0m, [32m'allow_search_indices'[0m: [3;91mFalse[0m, [32m'allow_view'[0m: [3;92mTrue[0m, 
[32m'allow_fine_tuning'[0m: [3;91mFalse[0m, [32m'organization'[0m: [32m'*'[0m, [32m'group'[0m: [3;35mNone[0m, [32m'is_blocking'[0m: 
[3;91mFalse[0m[1m}[0m[1m][0m[1m}[0m[1m][0m[1m}[0m
[32m⠋[0m Checking VLLM server at http://localhost:8000/v1...
[2K[32m⠋[0m Checking VLLM server at http://localhost:8000/v1...
[?25h
[1A[2K

### 2.1. Acquire and Ingest Documentation

For this example, we'll use the LangChain documentation page on [Chat Models](https://github.com/langchain-ai/langchain/blob/master/docs/docs/concepts/chat_models.mdx).

**To get the text:**
1.  Go to the raw version of the MDX file (e.g., by clicking "Raw" on GitHub).
2.  Copy the entire text content.
3.  Save it locally as a `.txt` file. For this notebook, we assume you've saved it as `/content/langchain-ai-langchain.txt`. You can use a tool like `gitingest` or manual copy-paste.

**Note:** Ensure the text file is uploaded to your Colab environment at `/content/langchain-ai-langchain.txt` if you're running this in Colab.

```python
>>> # Make sure synthetic_data_kit_config.yaml points to the 'data_docs' folder
>>> !synthetic-data-kit -c synthetic_data_kit_config.yaml ingest /content/langchain-ai-langchain.txt
```

[?25l[32m⠋[0m Processing /content/langchain-ai-langchain.txt...
[?25h
[1A[2K[32m Text successfully extracted to [0m[1;32mdata/output/langchain-ai-langchain.txt[0m

### 2.2. Chunk Data and Generate QA Pairs

The ingested document will be split into smaller chunks, and then QA pairs will be generated for each chunk.

```python
>>> filenames = generator.chunk_data("data/output/langchain-ai-langchain.txt")
>>> print(f"Created {len(filenames)} chunks.")
```

Created 3 chunks.

```python
import time
# Process 2 chunks for now -> can increase but slower!
for filename in filenames[:2]:
    !synthetic-data-kit \
        -c synthetic_data_kit_config.yaml \
        create {filename} \
        --num-pairs 25 \
        --type "qa"
    time.sleep(2) # Sleep some time to leave some room for processing
```

### 2.3. Format and Save QA Pairs

The generated QA pairs are then converted into a format suitable for fine-tuning.

```python
>>> qa_pairs_filenames = [
...     f"data/generated/langchain-ai-langchain_{i}_qa_pairs.json"
...     for i in range(len(filenames[:2]))
... ]
>>> for filename in qa_pairs_filenames:
...     !synthetic-data-kit \
...         -c synthetic_data_kit_config.yaml \
...         save-as {filename} -f ft
```

[?25l[32m⠋[0m Converting data/generated/langchain-ai-langchain_0_qa_pairs.json to ft format 
with json storage...
[?25h
[1A[2K[1A[2K[32m Converted to ft format and saved to [0m
[1;32mdata/final/langchain-ai-langchain_0_qa_pairs_ft.json[0m
[?25l[32m⠋[0m Converting data/generated/langchain-ai-langchain_1_qa_pairs.json to ft format 
with json storage...
[1A[2K[1A[2K[32m Converted to ft format and saved to [0m
[1;32mdata/final/langchain-ai-langchain_1_qa_pairs_ft.json[0m

```python
>>> generator.cleanup()
```

Attempting to terminate the VLLM server gracefully...
Server did not terminate gracefully after 10 seconds. Forcing kill...
Server killed forcefully.

### 2.4. Load the Formatted Dataset

Now, let's load the generated and formatted data.

```python
from datasets import Dataset
import pandas as pd
final_filenames = [
    f"data/final/langchain-ai-langchain_{i}_qa_pairs_ft.json"
    for i in range(len(filenames[:2]))
]
conversations = pd.concat([
    pd.read_json(name) for name in final_filenames
]).reset_index(drop = True)

dataset = Dataset.from_pandas(conversations)
```

```python
dataset[0]
```

```python
dataset[-1]
```

### Memory Management Note (Critical for Resource-Constrained Environments)

If you encounter CUDA Out-of-Memory (OOM) errors when trying to load the Llama model for fine-tuning in the next steps (even after `generator.cleanup()`), it means the GPU memory wasn't fully released. This is common in environments like Google Colab's free tier.

**Workaround Strategy:**
1.  **Archive the generated data:** After the `generator.cleanup()` cell, zip the entire `/content/data` folder and download to local.
2.  **Restart the Colab Runtime:** Go to "Runtime" -> "Restart runtime...". This completely clears GPU memory.
3.  **Re-run Installations & Imports:** Execute the initial installation cells and necessary import cells again.
4.  **Restore Data:** Upload the zip data folder and Unzip your data.
5.  **Load Dataset from Restored Files:** Use a script to load from the unzipped `/content/data/final/` directory.
6.  **Proceed to model loading and fine-tuning.**

The cells below include commands for zipping. If you restart, you'd manually run the unzip and data loading code from the "Optional: Restart and Reload Data" section.

```python
# !zip -r data.zip /content/
# !unzip data.zip

# import os
# import pandas as pd
# from datasets import Dataset

# # Path to your folder containing JSON files
# folder_path = 'content/data/final/'

# # List all .json files in the folder
# final_filenames = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.json')]

# # Read and combine the JSON files
# conversations = pd.concat([
#     pd.read_json(name) for name in final_filenames
# ]).reset_index(drop=True)

# # Convert to Hugging Face Dataset
# dataset = Dataset.from_pandas(conversations)
```

## 3. Fine-tuning the LLM with Unsloth

Now, we'll load our base model using Unsloth for 4-bit quantization and then fine-tune it on our synthetically generated dataset.

### 3.1. Load Base Model and Tokenizer

We'll use `Llama-3.2-3B-Instruct` in 4-bit precision. Unsloth makes this very memory-efficient.

```python
>>> from unsloth import FastLanguageModel
>>> import torch

>>> model, tokenizer = FastLanguageModel.from_pretrained(
...     model_name = "unsloth/Llama-3.2-3B-Instruct",
...     max_seq_length = 1024, # Choose any for long context!
...     load_in_4bit = True,  # 4 bit quantization to reduce memory
...     load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
...     full_finetuning = False, # [NEW!] We have full finetuning now!

... )
```

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-05 15:54:31 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.4.7: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.8.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

### 3.2. Add LoRA Adapters

We use LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.

```python
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)
```

### 3.3. Data Preparation for Chat Format

We need to format our dataset into the chat template expected by the Llama-3.2 model.

```
system

Cutting Knowledge Date: December 2023
Today Date: 01 May 2025

You are a helpful assistant.user

What is 1+1?assistant

2
```

```python
def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

# Get our previous dataset and format it:
dataset = dataset.map(formatting_prompts_func, batched = True,)
```

```python
dataset[0]
```

### 3.4. Train the Model

We'll use Hugging Face TRL's `SFTTrainer` to fine-tune the model with the SFTTrainer class—designed specifically for supervised fine-tuning (SFT). We configure the training parameters using SFTConfig, specifying the dataset, model, training steps, and optimization settings. This setup allows us to efficiently fine-tune models with gradient accumulation and mixed-precision optimizers like adamw_8bit, onlimited hardware environment.

```python
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model = model,
    processing_class = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)
```

```python
trainer_stats = trainer.train()
```

## 4. Inference and Testing

Let's test our fine-tuned model with some questions related to the LangChain Chat Models documentation.

```python
>>> messages = [
...     {"role": "user", "content": "What is the standard interface for binding tools to models?"},
... ]
>>> inputs = tokenizer.apply_chat_template(
...     messages,
...     tokenize = True,
...     add_generation_prompt = True, # Must add for generation
...     return_tensors = "pt",
... ).to("cuda")

>>> from transformers import TextStreamer
>>> text_streamer = TextStreamer(tokenizer, skip_prompt = True)
>>> _ = model.generate(input_ids = inputs, streamer = text_streamer,
...                    max_new_tokens = 256, temperature = 0.1)
```

Standard [tool calling API](/docs/concepts/tool_calling): standard interface for binding tools to models.

## 5. Conclusion

We have successfully:
1.  Acquired documentation text for [LangChain Chat Models](https://github.com/langchain-ai/langchain/blob/master/docs/docs/concepts/chat_models.mdx)..
2.  Generated synthetic Question/Answer pairs using [`synthetic-data-kit`](https://github.com/meta-llama/synthetic-data-kit).
3.  Fine-tuned a Llama-3.2-3B model efficiently using Unsloth and [Hugging Face's TRL SFTTrainer](https://huggingface.co/docs/trl/en/sft_trainer).
4.  Tested the model's ability to answer questions specific to the documentation.

This notebook provides a template for creating specialized chatbots for various documentation or domain-specific texts. The use of synthetic data generation and efficient fine-tuning techniques makes this approach accessible even with limited resources.

**Further improvements could include:**
*   Using a larger portion of the documentation or multiple related pages.
*   More sophisticated curation of synthetic QA pairs.
*   Experimenting with different base models or hyperparameter tuning.
*   Implementing a more robust evaluation framework (e.g., comparing against a held-out set of questions or using metrics like ROUGE, BLEU if applicable, or LLM-as-a-judge).

```python

```