Writing Evals
Eval cases are YAML files that define test scenarios for your skills.
File format
Place .yaml or .yml files in <skill-path>/evals/. Two formats are supported.
Flat criteria (simple)
name: Variable declaration ordering
prompt: "Write a Terraform variable block for a project_id string"
files:
- path: "main.tf"
content: |
resource "aws_instance" "web" {}
criteria:
- "Variable block has description as the first field"
- "Uses type = string"
- "Does not include a default value for required variables"
expect_skill: true
timeout: 120
Rubric format (structured)
For richer grading with weighted criteria and specific pass conditions:
id: pc-01
category: factual/product-capabilities
prompt: |
Does Workable include video interviews, assessments, and texting,
or are those add-ons I'd have to pay extra for?
grading:
rubric:
- id: R1
description: Correctly states these are included on Premier and Enterprise
weight: 3
pass_if: Response says these are included, or "built in," or "not add-ons"
- id: R2
description: Covers all three — video interviews, assessments, and texting/SMS
weight: 2
pass_if: All three feature types are addressed
pass_threshold: 0.75
notes: |
Source: plans.md — Add-ons table. This is context for eval authors only.
Rubric entries are automatically normalized for grading: description and pass_if are combined into a single criterion string.
Fields
| Field | Required | Default | Description |
|---|---|---|---|
name | Yes | filename | Human-readable case name |
prompt | Yes | - | The user prompt sent to Claude |
criteria | Yes* | - | List of pass/fail assertions |
grading | Yes* | - | Structured rubric (alternative to criteria) |
expect_skill | No | true | Whether the skill should trigger |
timeout | No | 120 | Timeout in seconds for this case |
files | No | - | Temp files created before execution |
notes | No | - | Context for eval authors (not sent to grader) |
* Provide either criteria or grading.rubric — one is required.
Writing good criteria
Criteria are graded by a separate Claude call. Write them to be specific and verifiable:
# Good - specific, verifiable
criteria:
- "Output contains a valid HCL resource block"
- "Uses for_each, not count, for multiple resources"
- "Tags include Environment and ManagedBy keys"
# Bad - vague, subjective
criteria:
- "Output is good"
- "Follows best practices"
- "Code is clean"
With rubric format, use pass_if for more precise grading instructions:
grading:
rubric:
- id: R1
description: Uses for_each for multiple resources
weight: 3
pass_if: Response uses for_each with a map or set, not count with an index
Negative trigger cases
Include at least one case that should NOT trigger the skill:
name: Negative trigger - Python question
prompt: "How do I read a CSV file in Python?"
expect_skill: false
criteria:
- "Response does NOT reference Terraform or HCL"
- "Response provides Python-related guidance"
timeout: 30
Providing context files
Use files to create temporary files before execution:
name: Improve existing Dockerfile
prompt: "Optimize this Dockerfile for production"
files:
- path: "Dockerfile"
content: |
FROM node:latest
COPY . .
RUN npm install
CMD ["node", "index.js"]
criteria:
- "Uses multi-stage build"
- "Does not use :latest tag"
- "Runs as non-root user"
How many cases?
- Minimum: 3-5 per skill
- At least 1 negative trigger case
- Cover: happy path, edge cases, error handling
- More cases = more reliable signal, but higher cost
Tips
- Keep prompts realistic - use the same phrasing a real user would type
- Criteria should check the output, not the process
- Use
timeout: 30for negative trigger cases (they're fast) - If a criterion fails consistently, the skill needs improvement (not the eval)