Eval Case Format

YAML schema

Two formats are supported: flat criteria strings or structured grading.rubric.

Flat criteria (simple)

name: string              # required - human-readable case name
prompt: string             # required - user prompt sent to Claude
criteria:                  # required - list of pass/fail assertions
  - string
  - string
expect_skill: boolean      # optional - default true
timeout: integer           # optional - default from action input (120s)
files:                     # optional - temp files created before execution
  - path: string
    content: string

Rubric format (structured)

id: string                 # optional - case identifier
category: string           # optional - for organizing cases
prompt: string             # required - user prompt sent to Claude
grading:
  rubric:
    - id: string           # optional - criterion identifier
      description: string  # required - what is being evaluated
      weight: number       # optional - relative importance
      pass_if: string      # optional - specific conditions for passing
  pass_threshold: number   # optional - per-case pass threshold (0.0-1.0)
expect_skill: boolean      # optional - default true
timeout: integer           # optional - default from action input (120s)
files:                     # optional - temp files created before execution
  - path: string
    content: string
notes: string              # optional - context for eval authors (not sent to grader)

Rubric entries are automatically normalized: description and pass_if are combined into a single criterion string for grading. Both .yaml and .yml extensions are supported.

How execution works

For positive cases (expect_skill: true), the SKILL.md content is prepended to the prompt so Claude has access to the skill's instructions.

For negative cases (expect_skill: false), the bare prompt is sent without skill content.

Grading output

Each case produces a grading.json:

{
  "expectations": [
    {
      "text": "Output contains a valid resource block",
      "passed": true,
      "evidence": "The response includes 'resource \"aws_instance\" \"web\" {...}'"
    },
    {
      "text": "Uses for_each, not count",
      "passed": false,
      "evidence": "Response uses 'count = var.instance_count' instead of for_each"
    }
  ],
  "summary": {
    "passed": 1,
    "failed": 1,
    "total": 2,
    "pass_rate": 0.5
  }
}

YAML schema​

Flat criteria (simple)​

Rubric format (structured)​

How execution works​

Grading output​

YAML schema

Flat criteria (simple)

Rubric format (structured)

How execution works

Grading output