Xử lý hàng loạt

Xử lý hàng loạt cho phép bạn chạy song song tác nhân Hermes trên hàng trăm hoặc hàng nghìn lời nhắc, tạo ra dữ liệu quỹ đạo có cấu trúc. Điều này chủ yếu được sử dụng để tạo dữ liệu đào tạo — tạo ra các quỹ đạo có định dạng ShareGPT với số liệu thống kê sử dụng công cụ có thể được sử dụng để tinh chỉnh hoặc đánh giá.

Tổng quan

Trình chạy hàng loạt (batch_runner.py) xử lý tập dữ liệu JSONL gồm các lời nhắc, chạy từng lời nhắc trong suốt phiên tác nhân đầy đủ với quyền truy cập công cụ. Mỗi lời nhắc có môi trường biệt lập riêng. Đầu ra là dữ liệu quỹ đạo có cấu trúc với đầy đủ lịch sử hội thoại, số liệu thống kê cuộc gọi công cụ và số liệu phạm vi lý luận.

Bắt đầu nhanh

# Basic batch run
python batch_runner.py \
    --dataset_file=data/prompts.jsonl \
    --batch_size=10 \
    --run_name=my_first_run \
    --model=anthropic/claude-sonnet-4.6 \
    --num_workers=4

# Resume an interrupted run
python batch_runner.py \
    --dataset_file=data/prompts.jsonl \
    --batch_size=10 \
    --run_name=my_first_run \
    --resume

# List available toolset distributions
python batch_runner.py --list_distributions

Định dạng tập dữ liệu

Tập dữ liệu đầu vào là một tệp JSONL (một đối tượng JSON trên mỗi dòng). Mỗi mục phải có trường promp:

{"prompt": "Write a Python function that finds the longest palindromic substring"}
{"prompt": "Create a REST API endpoint for user authentication using Flask"}
{"prompt": "Debug this error: TypeError: cannot unpack non-iterable NoneType object"}

Các mục có thể tùy chọn bao gồm:

image hoặc docker_image: Hình ảnh vùng chứa để sử dụng cho hộp cát của lời nhắc này (hoạt động với các chương trình phụ trợ Docker, Modal và Singularity)
cwd: Ghi đè thư mục làm việc cho phiên cuối của tác vụ

Tùy chọn cấu hình

Tham số	Mặc định	Mô tả
`--dataset_file`	(bắt buộc)	Đường dẫn đến tập dữ liệu JSONL
`--batch_size`	(bắt buộc)	Lời nhắc mỗi đợt
`--run_name`	(bắt buộc)	Tên cho lần chạy này (được sử dụng cho thư mục đầu ra và điểm kiểm tra)
`--phân phối`	`"mặc định"`	Phân phối bộ công cụ lấy mẫu từ
`--model`	`claude-sonnet-4.6`	Mẫu sử dụng
`--base_url`	`https://openrouter.ai/api/v1`	URL cơ sở API
`--api_key`	(env var)	Khóa API cho mô hình
`--max_turns`	`10`	Số lần lặp gọi công cụ tối đa trên mỗi lời nhắc
`--num_workers`	`4`	Quy trình công nhân song song
`--sơ yếu lý lịch`	`sai`	Tiếp tục từ điểm kiểm tra
`--tiết tiết`	`sai`	Cho phép ghi nhật ký dài dòng
`--max_samples`	tất cả	Chỉ xử lý N mẫu đầu tiên từ tập dữ liệu
`--max_tokens`	mô hình mặc định	Mã thông báo tối đa cho mỗi phản hồi mô hình

Định tuyến nhà cung cấp (OpenRouter)

Tham số	Mô tả
`--providers_allowed`	Các nhà cung cấp được phân tách bằng dấu phẩy để cho phép (ví dụ: `"anthropic,openai"`)
`--providers_ignored`	Các nhà cung cấp được phân tách bằng dấu phẩy cần bỏ qua (ví dụ: `" together,deepinfra"`)
`--providers_order`	Thứ tự nhà cung cấp ưu tiên được phân tách bằng dấu phẩy
`--provider_sort`	Sort by `"price"`, `"throughput"`, or `"latency"`

Reasoning Control

Parameter	Description
`--reasoning_effort`	Effort level: `xhigh`, `high`, `medium`, `low`, `minimal`, `none`
`--reasoning_disabled`	Completely disable reasoning/thinking tokens

Advanced Options

Parameter	Description
`--ephemeral_system_prompt`	System prompt used during execution but NOT saved to trajectories
`--log_prefix_chars`	Characters to show in log previews (default: 100)
`--prefill_messages_file`	Path to JSON file with prefill messages for few-shot priming

Toolset Distributions

Each prompt gets a randomly sampled set of toolsets from a distribution. This ensures training data covers diverse tool combinations. Use --list_distributions to see all available distributions.

In the current implementation, distributions assign a probability to each individual toolset. The sampler flips each toolset independently, then guarantees that at least one toolset is enabled. This is different from a hand-authored table of prebuilt combinations.

Output Format

All output goes to data/<run_name>/:

data/my_run/
├── trajectories.jsonl    # Combined final output (all batches merged)
├── batch_0.jsonl         # Individual batch results
├── batch_1.jsonl
├── ...
├── checkpoint.json       # Resume checkpoint
└── statistics.json       # Aggregate tool usage stats

Trajectory Format

Each line in trajectories.jsonl is a JSON object:

{
  "prompt_index": 42,
  "conversations": [
    {"from": "human", "value": "Write a function..."},
    {"from": "gpt", "value": "I'll create that function...",
     "tool_calls": [...]},
    {"from": "tool", "value": "..."},
    {"from": "gpt", "value": "Here's the completed function..."}
  ],
  "metadata": {
    "batch_num": 2,
    "timestamp": "2026-01-15T10:30:00",
    "model": "anthropic/claude-sonnet-4.6"
  },
  "completed": true,
  "partial": false,
  "api_calls": 3,
  "toolsets_used": ["terminal", "file"],
  "tool_stats": {
    "terminal": {"count": 2, "success": 2, "failure": 0},
    "read_file": {"count": 1, "success": 1, "failure": 0}
  },
  "tool_error_counts": {
    "terminal": 0,
    "read_file": 0
  }
}

The conversations field uses a ShareGPT-like format with from and value fields. Tool stats are normalized to include all possible tools with zero defaults, ensuring consistent schema across entries for HuggingFace datasets compatibility.

Checkpointing

The batch runner has robust checkpointing for fault tolerance:

Checkpoint file: Saved after each batch completes, tracking which prompt indices are done
Content-based resume: On --resume, the runner scans existing batch files and matches completed prompts by their actual text content (not just indices), enabling recovery even if the dataset order changes
Failed prompts: Only successfully completed prompts are marked as done — failed prompts will be retried on resume
Batch merging: On completion, all batch files (including from previous runs) are merged into a single trajectories.jsonl

How Resume Works

Scan all batch_*.jsonl files for completed prompts (by content matching)
Filter the dataset to exclude already-completed prompts
Re-batch the remaining prompts
Process only the remaining prompts
Merge all batch files (old + new) into final output

Quality Filtering

The batch runner applies automatic quality filtering:

No-reasoning filter: Samples where zero assistant turns contain reasoning (no <REASONING_SCRATCHPAD> or native thinking tokens) are discarded
Corrupted entry filter: Entries with hallucinated tool names (not in the valid tool list) are filtered out during the final merge
Reasoning statistics: Tracks percentage of turns with/without reasoning across the entire run

Statistics

After completion, the runner prints comprehensive statistics:

Tool usage: Call counts, success/failure rates per tool
Reasoning coverage: Percentage of assistant turns with reasoning
Samples discarded: Count of samples filtered for lacking reasoning
Duration: Total processing time

Statistics are also saved to statistics.json for programmatic analysis.

Use Cases

Training Data Generation

Generate diverse tool-use trajectories for fine-tuning:

python batch_runner.py \
    --dataset_file=data/coding_prompts.jsonl \
    --batch_size=20 \
    --run_name=coding_v1 \
    --model=anthropic/claude-sonnet-4.6 \
    --num_workers=8 \
    --distribution=default \
    --max_turns=15

Model Evaluation

Evaluate how well a model uses tools across standardized prompts:

python batch_runner.py \
    --dataset_file=data/eval_suite.jsonl \
    --batch_size=10 \
    --run_name=eval_gpt4 \
    --model=openai/gpt-4o \
    --num_workers=4 \
    --max_turns=10

Per-Prompt Container Images

For benchmarks requiring specific environments, each prompt can specify its own container image:

{"prompt": "Install numpy and compute eigenvalues of a 3x3 matrix", "image": "python:3.11-slim"}
{"prompt": "Compile this Rust program and run it", "image": "rust:1.75"}
{"prompt": "Set up a Node.js Express server", "image": "node:20-alpine", "cwd": "/app"}

The batch runner verifies Docker images are accessible before running each prompt.

Tổng quan​

Bắt đầu nhanh​

Định dạng tập dữ liệu​

Tùy chọn cấu hình​

Định tuyến nhà cung cấp (OpenRouter)​

Reasoning Control​

Advanced Options​

Toolset Distributions​

Output Format​

Trajectory Format​

Checkpointing​

How Resume Works​

Quality Filtering​

Statistics​

Use Cases​

Training Data Generation​

Model Evaluation​

Per-Prompt Container Images​