YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Nemotron-Terminal-Corpus Processing Pipeline
Scripts used to process nvidia/Nemotron-Terminal-Corpus (366K trajectories) into prometheus04/nemotron-terminal-microagent (50K filtered + converted trajectories).
Files
| File | Description |
|---|---|
pipeline.py |
Main production script โ ran as an hf_jobs CPU job. Does everything: loading, filtering, converting, sampling, pushing. |
extract_tb2_instructions.py |
Standalone script to extract 89 task instructions from Terminal-Bench 2.0 for decontamination. |
test_pipeline.py |
Development/test scripts for inspecting dataset format, validating conversion logic, and debugging edge cases. |
How to run
Full pipeline (as HF job)
# Submitted via hf_jobs API:
# Hardware: cpu-upgrade (8 vCPU, 32GB RAM)
# Timeout: 3 hours
# Dependencies: datasets, huggingface_hub, pyarrow
# Total runtime: ~42 minutes
Locally
pip install datasets huggingface_hub pyarrow
python pipeline.py
Pipeline stages
1. TB2.0 Decontamination Setup
- Downloads 89 task instructions from
harborframework/terminal-bench-2.0 - Builds a set of 11,833 unique 14-word-grams for overlap checking
- Paper reference: Section 4.4 of arxiv:2602.21193
2. Quality Filtering + Format Conversion (per parquet file)
Streams each of 29 parquet files and processes row-by-row:
Filters (reject if):
too_short: <3 messages in conversationmalformed_json: >50% of assistant turns have invalid Terminus-2 JSONchinese_chars: Chinese characters in assistant contentidentity_leak: Mentions DeepSeek model name or hosted_vllm providertb2_contaminated: 14-gram word overlap with any Terminal-Bench 2.0 instructiontoo_long: >110,000 characters (โ >32K Hunyuan-4B tokens at 3.5 chars/token)
Format conversion (Terminus-2 JSON โ MicroAgent XML):
INPUT (assistant turn):
<think>
[reasoning text]
</think>
{
"analysis": "Current state analysis...",
"plan": "Next steps plan...",
"commands": [
{"keystrokes": "ls -la\n", "duration": 0.1},
{"keystrokes": "cd project\n", "duration": 0.1}
],
"task_complete": false
}
OUTPUT (assistant turn):
<thinking>
[reasoning text]
</thinking>
<bash>
ls -la
cd project
</bash>
Edge case: JSON inside <think> block
Sometimes the model puts the JSON payload inside the <think> tags (model artifact). The extractor handles this by searching for {"analysis", {"plan", or {"commands" patterns throughout the entire content.
Edge case: Thinking-only turns
When JSON extraction fails but <think> exists, the turn is salvaged as thinking-only (no <bash> block). Rejected only if >50% of assistant turns fail.
3. Diversity Sampling
Weighted sampling from 340K โ 50K using domain and difficulty weights:
| Domain | Weight | Difficulty | Weight |
|---|---|---|---|
| software_engineering | 2.0x | medium | 1.5x |
| debugging | 2.0x | easy | 1.0x |
| security | 1.8x | mixed | 0.8x |
| swe | 1.8x | na (adapters) | 1.2x |
| code | 1.5x | ||
| system_administration | 1.5x | ||
| data_science | 1.3x | ||
| scientific_computing | 1.3x | ||
| others | 1.0x |
4. Push to HF Hub
Pushes as standard HF Dataset with columns:
conversations: List[Dict] with{role, content}โ MicroAgent XML formattask: Task identifiersource_category: Domain (code, math, swe, debugging, etc.)difficulty: easy / medium / mixed / naconfig: Original config pathest_token_count: Estimated Hunyuan-4B token count (chars / 3.5)enable_thinking: Whether DeepSeek thinking mode was enabled
Results
| Stage | Count |
|---|---|
| Input | 366,154 |
| After filtering + conversion | 340,191 (92.9%) |
| After sampling | 50,000 |
| Filter | Removed |
|---|---|
| too_long | 22,200 |
| malformed_json | 2,571 |
| too_short | 1,190 |
| identity_leak | 1 |
| tb2_contaminated | 1 |
Memory optimization
The pipeline writes intermediate results to JSONL on disk instead of holding all 340K rows in memory. This allows it to run on cpu-upgrade (32GB RAM) without OOM. Each parquet file is processed independently and freed after writing.
Conversations are serialized as JSON strings in the JSONL to avoid nested object overhead. They're deserialized back when building the final HF Dataset for upload.