YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Nemotron-Terminal-Corpus Processing Pipeline

Scripts used to process nvidia/Nemotron-Terminal-Corpus (366K trajectories) into prometheus04/nemotron-terminal-microagent (50K filtered + converted trajectories).

Files

File Description
pipeline.py Main production script โ€” ran as an hf_jobs CPU job. Does everything: loading, filtering, converting, sampling, pushing.
extract_tb2_instructions.py Standalone script to extract 89 task instructions from Terminal-Bench 2.0 for decontamination.
test_pipeline.py Development/test scripts for inspecting dataset format, validating conversion logic, and debugging edge cases.

How to run

Full pipeline (as HF job)

# Submitted via hf_jobs API:
# Hardware: cpu-upgrade (8 vCPU, 32GB RAM)
# Timeout: 3 hours
# Dependencies: datasets, huggingface_hub, pyarrow
# Total runtime: ~42 minutes

Locally

pip install datasets huggingface_hub pyarrow
python pipeline.py

Pipeline stages

1. TB2.0 Decontamination Setup

  • Downloads 89 task instructions from harborframework/terminal-bench-2.0
  • Builds a set of 11,833 unique 14-word-grams for overlap checking
  • Paper reference: Section 4.4 of arxiv:2602.21193

2. Quality Filtering + Format Conversion (per parquet file)

Streams each of 29 parquet files and processes row-by-row:

Filters (reject if):

  • too_short: <3 messages in conversation
  • malformed_json: >50% of assistant turns have invalid Terminus-2 JSON
  • chinese_chars: Chinese characters in assistant content
  • identity_leak: Mentions DeepSeek model name or hosted_vllm provider
  • tb2_contaminated: 14-gram word overlap with any Terminal-Bench 2.0 instruction
  • too_long: >110,000 characters (โ‰ˆ >32K Hunyuan-4B tokens at 3.5 chars/token)

Format conversion (Terminus-2 JSON โ†’ MicroAgent XML):

INPUT (assistant turn):
<think>
[reasoning text]
</think>
{
  "analysis": "Current state analysis...",
  "plan": "Next steps plan...",
  "commands": [
    {"keystrokes": "ls -la\n", "duration": 0.1},
    {"keystrokes": "cd project\n", "duration": 0.1}
  ],
  "task_complete": false
}

OUTPUT (assistant turn):
<thinking>
[reasoning text]
</thinking>
<bash>
ls -la
cd project
</bash>

Edge case: JSON inside <think> block Sometimes the model puts the JSON payload inside the <think> tags (model artifact). The extractor handles this by searching for {"analysis", {"plan", or {"commands" patterns throughout the entire content.

Edge case: Thinking-only turns When JSON extraction fails but <think> exists, the turn is salvaged as thinking-only (no <bash> block). Rejected only if >50% of assistant turns fail.

3. Diversity Sampling

Weighted sampling from 340K โ†’ 50K using domain and difficulty weights:

Domain Weight Difficulty Weight
software_engineering 2.0x medium 1.5x
debugging 2.0x easy 1.0x
security 1.8x mixed 0.8x
swe 1.8x na (adapters) 1.2x
code 1.5x
system_administration 1.5x
data_science 1.3x
scientific_computing 1.3x
others 1.0x

4. Push to HF Hub

Pushes as standard HF Dataset with columns:

  • conversations: List[Dict] with {role, content} โ€” MicroAgent XML format
  • task: Task identifier
  • source_category: Domain (code, math, swe, debugging, etc.)
  • difficulty: easy / medium / mixed / na
  • config: Original config path
  • est_token_count: Estimated Hunyuan-4B token count (chars / 3.5)
  • enable_thinking: Whether DeepSeek thinking mode was enabled

Results

Stage Count
Input 366,154
After filtering + conversion 340,191 (92.9%)
After sampling 50,000
Filter Removed
too_long 22,200
malformed_json 2,571
too_short 1,190
identity_leak 1
tb2_contaminated 1

Memory optimization

The pipeline writes intermediate results to JSONL on disk instead of holding all 340K rows in memory. This allows it to run on cpu-upgrade (32GB RAM) without OOM. Each parquet file is processed independently and freed after writing.

Conversations are serialized as JSON strings in the JSONL to avoid nested object overhead. They're deserialized back when building the final HF Dataset for upload.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for prometheus04/nemotron-terminal-pipeline