Skip to content

Skill Evals (evals.json)

Agent Skills is an open standard for describing AI agent capabilities. Its evals.json format defines simple test cases for skills — a prompt, expected output, and natural-language assertions.

AgentV natively supports evals.json. You can run Agent Skills evals directly:

Terminal window
agentv eval evals.json --target claude

When you need AgentV’s power features (deterministic evaluators, composite scoring, multi-turn conversations, workspace isolation), you can graduate to EVAL.yaml.

Create evals.json:

{
"skill_name": "csv-analyzer",
"evals": [
{
"id": 1,
"prompt": "I have a CSV of monthly sales data in evals/files/sales.csv. Find the top 3 months by revenue.",
"expected_output": "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400).",
"files": ["evals/files/sales.csv"],
"assertions": [
"Output identifies November as the highest revenue month",
"Output includes exactly 3 months",
"Revenue figures are included for each month"
]
}
]
}

Run it:

Terminal window
agentv eval evals.json --target claude

The --target flag selects the agent harness. The agent evaluates itself — skills load naturally via progressive disclosure.

When AgentV loads evals.json, it promotes fields to its internal representation:

evals.jsonEVAL.yaml equivalentNotes
promptinputWrapped as [{role: "user", content: prompt}]
expected_outputexpected_output + criteriaUsed as reference answer and evaluation criteria
assertions[]assert[]Each string becomes {type: llm-judge, prompt: text}
files[]file_pathsResolved relative to evals.json, copied into workspace
skill_namemetadata.skill_nameCarried as metadata
id (number)id (string)Converted via String(id)

The files[] field lists files that the agent needs during evaluation. Paths are relative to the evals.json location:

{
"evals": [
{
"id": 1,
"prompt": "Analyze the sales data",
"files": ["evals/files/sales.csv", "evals/files/config.json"]
}
]
}

AgentV resolves these paths and copies the files into the workspace before the agent runs. If a file is missing, the test case fails with a file_copy_error.

AgentV’s prompt subcommands work with evals.json, enabling agent-mode evaluation without API keys:

Terminal window
# Show orchestration overview
agentv prompt eval evals.json
# Get input for a specific test
agentv prompt eval input evals.json --test-id 1
# Get judge prompts for a test
agentv prompt eval judge evals.json --test-id 1 --answer-file answer.txt

The eval-candidate and eval-judge agents can be dispatched against evals.json just like EVAL.yaml files.

Generate an Agent Skills compatible benchmark.json alongside the standard result JSONL:

Terminal window
agentv eval evals.json --target claude --benchmark-json benchmark.json

The benchmark uses AgentV’s pass threshold (score >= 0.8) to map continuous scores to the binary pass/fail that Agent Skills pass_rate expects:

{
"run_summary": {
"with_skill": {
"pass_rate": {"mean": 0.83, "stddev": 0.06},
"time_seconds": {"mean": 45.0, "stddev": 12.0},
"tokens": {"mean": 3800, "stddev": 400}
}
}
}

When you’re ready to graduate, convert your evals.json to EVAL.yaml:

Terminal window
# Output to stdout
agentv convert evals.json
# Write to file
agentv convert evals.json -o eval.yaml

The generated YAML includes comments about available AgentV features you can use:

# Converted from Agent Skills evals.json
# AgentV features you can add:
# - type: is_json, contains, regex for deterministic evaluators
# - type: code-judge for custom scoring scripts
# - Multi-turn conversations via input message arrays
# - Composite evaluators with weighted scoring
# - Workspace isolation with repos and hooks
tests:
- id: "1"
criteria: |-
The top 3 months by revenue are November, September, and December.
input:
- role: user
content: "Find the top 3 months by revenue."
# Promoted from evals.json assertions[]
# Replace with type: is_json, contains, or regex for deterministic checks
assert:
- name: assertion-1
type: llm-judge
prompt: "Output identifies November as the highest revenue month"

Use evals.json when:

  • You’re building a skill and want quick feedback loops
  • Your assertions are natural-language (“output includes a chart”, “response is polite”)
  • You want compatibility with other Agent Skills tooling
  • Tests don’t need workspace isolation or deterministic checks

Switch to EVAL.yaml when you need:

  • Deterministic evaluators: contains, regex, equals, is-json — faster and cheaper than LLM judges
  • Composite scoring: Weighted evaluators with custom aggregation
  • Multi-turn conversations: Multi-message input sequences
  • Workspace isolation: Sandboxed file systems per test case
  • Tool trajectory evaluation: Assert on the sequence of tool calls
  • Matrix evaluation: Test across multiple targets simultaneously

The same eval expressed in both formats:

{
"skill_name": "support-agent",
"evals": [
{
"id": 1,
"prompt": "A customer says their order #12345 hasn't arrived after 2 weeks. Help them.",
"expected_output": "An empathetic response that offers to track the order and provides next steps.",
"assertions": [
"Response acknowledges the customer's frustration",
"Response offers to look up order #12345",
"Response provides clear next steps"
]
}
]
}
tests:
- id: "1"
input: |
A customer says their order #12345 hasn't arrived after 2 weeks. Help them.
expected_output: |
An empathetic response that offers to track the order and provides next steps.
assert:
- name: acknowledges-frustration
type: llm-judge
prompt: Response acknowledges the customer's frustration
- name: looks-up-order
type: contains
value: "12345"
- name: has-next-steps
type: llm-judge
prompt: Response provides clear next steps

Notice how the EVAL.yaml version can mix llm-judge (for subjective checks) with contains (for deterministic checks) — the order number check is now instant and free.