Skill Evals (evals.json)
Overview
Section titled “Overview”Agent Skills is an open standard for describing AI agent capabilities. Its evals.json format defines simple test cases for skills — a prompt, expected output, and natural-language assertions.
AgentV natively supports evals.json. You can run Agent Skills evals directly:
agentv eval evals.json --target claudeWhen you need AgentV’s power features (deterministic evaluators, composite scoring, multi-turn conversations, workspace isolation), you can graduate to EVAL.yaml.
Quick start
Section titled “Quick start”Create evals.json:
{ "skill_name": "csv-analyzer", "evals": [ { "id": 1, "prompt": "I have a CSV of monthly sales data in evals/files/sales.csv. Find the top 3 months by revenue.", "expected_output": "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400).", "files": ["evals/files/sales.csv"], "assertions": [ "Output identifies November as the highest revenue month", "Output includes exactly 3 months", "Revenue figures are included for each month" ] } ]}Run it:
agentv eval evals.json --target claudeThe --target flag selects the agent harness. The agent evaluates itself — skills load naturally via progressive disclosure.
Field mapping
Section titled “Field mapping”When AgentV loads evals.json, it promotes fields to its internal representation:
| evals.json | EVAL.yaml equivalent | Notes |
|---|---|---|
prompt | input | Wrapped as [{role: "user", content: prompt}] |
expected_output | expected_output + criteria | Used as reference answer and evaluation criteria |
assertions[] | assert[] | Each string becomes {type: llm-judge, prompt: text} |
files[] | file_paths | Resolved relative to evals.json, copied into workspace |
skill_name | metadata.skill_name | Carried as metadata |
id (number) | id (string) | Converted via String(id) |
Files support
Section titled “Files support”The files[] field lists files that the agent needs during evaluation. Paths are relative to the evals.json location:
{ "evals": [ { "id": 1, "prompt": "Analyze the sales data", "files": ["evals/files/sales.csv", "evals/files/config.json"] } ]}AgentV resolves these paths and copies the files into the workspace before the agent runs. If a file is missing, the test case fails with a file_copy_error.
Agent mode (no API keys)
Section titled “Agent mode (no API keys)”AgentV’s prompt subcommands work with evals.json, enabling agent-mode evaluation without API keys:
# Show orchestration overviewagentv prompt eval evals.json
# Get input for a specific testagentv prompt eval input evals.json --test-id 1
# Get judge prompts for a testagentv prompt eval judge evals.json --test-id 1 --answer-file answer.txtThe eval-candidate and eval-judge agents can be dispatched against evals.json just like EVAL.yaml files.
Benchmark output
Section titled “Benchmark output”Generate an Agent Skills compatible benchmark.json alongside the standard result JSONL:
agentv eval evals.json --target claude --benchmark-json benchmark.jsonThe benchmark uses AgentV’s pass threshold (score >= 0.8) to map continuous scores to the binary pass/fail that Agent Skills pass_rate expects:
{ "run_summary": { "with_skill": { "pass_rate": {"mean": 0.83, "stddev": 0.06}, "time_seconds": {"mean": 45.0, "stddev": 12.0}, "tokens": {"mean": 3800, "stddev": 400} } }}Converting to EVAL.yaml
Section titled “Converting to EVAL.yaml”When you’re ready to graduate, convert your evals.json to EVAL.yaml:
# Output to stdoutagentv convert evals.json
# Write to fileagentv convert evals.json -o eval.yamlThe generated YAML includes comments about available AgentV features you can use:
# Converted from Agent Skills evals.json# AgentV features you can add:# - type: is_json, contains, regex for deterministic evaluators# - type: code-judge for custom scoring scripts# - Multi-turn conversations via input message arrays# - Composite evaluators with weighted scoring# - Workspace isolation with repos and hooks
tests: - id: "1" criteria: |- The top 3 months by revenue are November, September, and December. input: - role: user content: "Find the top 3 months by revenue." # Promoted from evals.json assertions[] # Replace with type: is_json, contains, or regex for deterministic checks assert: - name: assertion-1 type: llm-judge prompt: "Output identifies November as the highest revenue month"When to stay with evals.json
Section titled “When to stay with evals.json”Use evals.json when:
- You’re building a skill and want quick feedback loops
- Your assertions are natural-language (“output includes a chart”, “response is polite”)
- You want compatibility with other Agent Skills tooling
- Tests don’t need workspace isolation or deterministic checks
When to graduate to EVAL.yaml
Section titled “When to graduate to EVAL.yaml”Switch to EVAL.yaml when you need:
- Deterministic evaluators:
contains,regex,equals,is-json— faster and cheaper than LLM judges - Composite scoring: Weighted evaluators with custom aggregation
- Multi-turn conversations: Multi-message input sequences
- Workspace isolation: Sandboxed file systems per test case
- Tool trajectory evaluation: Assert on the sequence of tool calls
- Matrix evaluation: Test across multiple targets simultaneously
Side-by-side comparison
Section titled “Side-by-side comparison”The same eval expressed in both formats:
evals.json
Section titled “evals.json”{ "skill_name": "support-agent", "evals": [ { "id": 1, "prompt": "A customer says their order #12345 hasn't arrived after 2 weeks. Help them.", "expected_output": "An empathetic response that offers to track the order and provides next steps.", "assertions": [ "Response acknowledges the customer's frustration", "Response offers to look up order #12345", "Response provides clear next steps" ] } ]}EVAL.yaml equivalent
Section titled “EVAL.yaml equivalent”tests: - id: "1" input: | A customer says their order #12345 hasn't arrived after 2 weeks. Help them. expected_output: | An empathetic response that offers to track the order and provides next steps. assert: - name: acknowledges-frustration type: llm-judge prompt: Response acknowledges the customer's frustration - name: looks-up-order type: contains value: "12345" - name: has-next-steps type: llm-judge prompt: Response provides clear next stepsNotice how the EVAL.yaml version can mix llm-judge (for subjective checks) with contains (for deterministic checks) — the order number check is now instant and free.