Eval Case Format
YAML schema
Two formats are supported: flat criteria strings or structured grading.rubric.
Flat criteria (simple)
name: string # required - human-readable case name
prompt: string # required - user prompt sent to Claude
criteria: # required - list of pass/fail assertions
- string
- string
expect_skill: boolean # optional - default true
timeout: integer # optional - default from action input (120s)
files: # optional - temp files created before execution
- path: string
content: string
Rubric format (structured)
id: string # optional - case identifier
category: string # optional - for organizing cases
prompt: string # required - user prompt sent to Claude
grading:
rubric:
- id: string # optional - criterion identifier
description: string # required - what is being evaluated
weight: number # optional - relative importance
pass_if: string # optional - specific conditions for passing
pass_threshold: number # optional - per-case pass threshold (0.0-1.0)
expect_skill: boolean # optional - default true
timeout: integer # optional - default from action input (120s)
files: # optional - temp files created before execution
- path: string
content: string
notes: string # optional - context for eval authors (not sent to grader)
Rubric entries are automatically normalized: description and pass_if are combined into a single criterion string for grading. Both .yaml and .yml extensions are supported.
How execution works
For positive cases (expect_skill: true), the SKILL.md content is prepended to the prompt so Claude has access to the skill's instructions.
For negative cases (expect_skill: false), the bare prompt is sent without skill content.
Grading output
Each case produces a grading.json:
{
"expectations": [
{
"text": "Output contains a valid resource block",
"passed": true,
"evidence": "The response includes 'resource \"aws_instance\" \"web\" {...}'"
},
{
"text": "Uses for_each, not count",
"passed": false,
"evidence": "Response uses 'count = var.instance_count' instead of for_each"
}
],
"summary": {
"passed": 1,
"failed": 1,
"total": 2,
"pass_rate": 0.5
}
}