Nemotron-Terminal-Corpus Processing Pipeline

Scripts used to process nvidia/Nemotron-Terminal-Corpus (366K trajectories) into prometheus04/nemotron-terminal-microagent (50K filtered + converted trajectories).

Files

File	Description
`pipeline.py`	Main production script — ran as an `hf_jobs` CPU job. Does everything: loading, filtering, converting, sampling, pushing.
`extract_tb2_instructions.py`	Standalone script to extract 89 task instructions from Terminal-Bench 2.0 for decontamination.
`test_pipeline.py`	Development/test scripts for inspecting dataset format, validating conversion logic, and debugging edge cases.

How to run

Full pipeline (as HF job)

# Submitted via hf_jobs API:
# Hardware: cpu-upgrade (8 vCPU, 32GB RAM)
# Timeout: 3 hours
# Dependencies: datasets, huggingface_hub, pyarrow
# Total runtime: ~42 minutes

Locally

pip install datasets huggingface_hub pyarrow
python pipeline.py

Pipeline stages

1. TB2.0 Decontamination Setup

Downloads 89 task instructions from harborframework/terminal-bench-2.0
Builds a set of 11,833 unique 14-word-grams for overlap checking
Paper reference: Section 4.4 of arxiv:2602.21193

2. Quality Filtering + Format Conversion (per parquet file)

Streams each of 29 parquet files and processes row-by-row:

Filters (reject if):

too_short: <3 messages in conversation
malformed_json: >50% of assistant turns have invalid Terminus-2 JSON
chinese_chars: Chinese characters in assistant content
identity_leak: Mentions DeepSeek model name or hosted_vllm provider
tb2_contaminated: 14-gram word overlap with any Terminal-Bench 2.0 instruction
too_long: >110,000 characters (≈ >32K Hunyuan-4B tokens at 3.5 chars/token)

Format conversion (Terminus-2 JSON → MicroAgent XML):

INPUT (assistant turn):
<think>
[reasoning text]
</think>
{
  "analysis": "Current state analysis...",
  "plan": "Next steps plan...",
  "commands": [
    {"keystrokes": "ls -la\n", "duration": 0.1},
    {"keystrokes": "cd project\n", "duration": 0.1}
  ],
  "task_complete": false
}

OUTPUT (assistant turn):
<thinking>
[reasoning text]
</thinking>
<bash>
ls -la
cd project
</bash>

Edge case: JSON inside <think> block Sometimes the model puts the JSON payload inside the <think> tags (model artifact). The extractor handles this by searching for {"analysis", {"plan", or {"commands" patterns throughout the entire content.

Edge case: Thinking-only turns When JSON extraction fails but <think> exists, the turn is salvaged as thinking-only (no <bash> block). Rejected only if >50% of assistant turns fail.

3. Diversity Sampling

Weighted sampling from 340K → 50K using domain and difficulty weights:

Domain	Weight	Difficulty	Weight
software_engineering	2.0x	medium	1.5x
debugging	2.0x	easy	1.0x
security	1.8x	mixed	0.8x
swe	1.8x	na (adapters)	1.2x
code	1.5x
system_administration	1.5x
data_science	1.3x
scientific_computing	1.3x
others	1.0x

4. Push to HF Hub

Pushes as standard HF Dataset with columns:

conversations: List[Dict] with {role, content} — MicroAgent XML format
task: Task identifier
source_category: Domain (code, math, swe, debugging, etc.)
difficulty: easy / medium / mixed / na
config: Original config path
est_token_count: Estimated Hunyuan-4B token count (chars / 3.5)
enable_thinking: Whether DeepSeek thinking mode was enabled

Results

Stage	Count
Input	366,154
After filtering + conversion	340,191 (92.9%)
After sampling	50,000

Filter	Removed
too_long	22,200
malformed_json	2,571
too_short	1,190
identity_leak	1
tb2_contaminated	1

Memory optimization

The pipeline writes intermediate results to JSONL on disk instead of holding all 340K rows in memory. This allows it to run on cpu-upgrade (32GB RAM) without OOM. Each parquet file is processed independently and freed after writing.

Conversations are serialized as JSON strings in the JSONL to avoid nested object overhead. They're deserialized back when building the final HF Dataset for upload.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for prometheus04/nemotron-terminal-pipeline

On Data Engineering for Scaling LLM Terminal Capabilities

Paper • 2602.21193 • Published Feb 24 • 103