Skip to main content

Writing Evals

Eval cases are YAML files that define test scenarios for your skills.

File format

Place .yaml or .yml files in <skill-path>/evals/. Two formats are supported.

Flat criteria (simple)

name: Variable declaration ordering
prompt: "Write a Terraform variable block for a project_id string"
files:
- path: "main.tf"
content: |
resource "aws_instance" "web" {}
criteria:
- "Variable block has description as the first field"
- "Uses type = string"
- "Does not include a default value for required variables"
expect_skill: true
timeout: 120

Rubric format (structured)

For richer grading with weighted criteria and specific pass conditions:

id: pc-01
category: factual/product-capabilities
prompt: |
Does Workable include video interviews, assessments, and texting,
or are those add-ons I'd have to pay extra for?
grading:
rubric:
- id: R1
description: Correctly states these are included on Premier and Enterprise
weight: 3
pass_if: Response says these are included, or "built in," or "not add-ons"
- id: R2
description: Covers all three — video interviews, assessments, and texting/SMS
weight: 2
pass_if: All three feature types are addressed
pass_threshold: 0.75
notes: |
Source: plans.md — Add-ons table. This is context for eval authors only.

Rubric entries are automatically normalized for grading: description and pass_if are combined into a single criterion string.

Fields

FieldRequiredDefaultDescription
nameYesfilenameHuman-readable case name
promptYes-The user prompt sent to Claude
criteriaYes*-List of pass/fail assertions
gradingYes*-Structured rubric (alternative to criteria)
expect_skillNotrueWhether the skill should trigger
timeoutNo120Timeout in seconds for this case
filesNo-Temp files created before execution
notesNo-Context for eval authors (not sent to grader)

* Provide either criteria or grading.rubric — one is required.

Writing good criteria

Criteria are graded by a separate Claude call. Write them to be specific and verifiable:

# Good - specific, verifiable
criteria:
- "Output contains a valid HCL resource block"
- "Uses for_each, not count, for multiple resources"
- "Tags include Environment and ManagedBy keys"

# Bad - vague, subjective
criteria:
- "Output is good"
- "Follows best practices"
- "Code is clean"

With rubric format, use pass_if for more precise grading instructions:

grading:
rubric:
- id: R1
description: Uses for_each for multiple resources
weight: 3
pass_if: Response uses for_each with a map or set, not count with an index

Negative trigger cases

Include at least one case that should NOT trigger the skill:

name: Negative trigger - Python question
prompt: "How do I read a CSV file in Python?"
expect_skill: false
criteria:
- "Response does NOT reference Terraform or HCL"
- "Response provides Python-related guidance"
timeout: 30

Providing context files

Use files to create temporary files before execution:

name: Improve existing Dockerfile
prompt: "Optimize this Dockerfile for production"
files:
- path: "Dockerfile"
content: |
FROM node:latest
COPY . .
RUN npm install
CMD ["node", "index.js"]
criteria:
- "Uses multi-stage build"
- "Does not use :latest tag"
- "Runs as non-root user"

How many cases?

  • Minimum: 3-5 per skill
  • At least 1 negative trigger case
  • Cover: happy path, edge cases, error handling
  • More cases = more reliable signal, but higher cost

Tips

  • Keep prompts realistic - use the same phrasing a real user would type
  • Criteria should check the output, not the process
  • Use timeout: 30 for negative trigger cases (they're fast)
  • If a criterion fails consistently, the skill needs improvement (not the eval)