Writing Evals
Eval cases are YAML files that define test scenarios for your skills.
File format
Place YAML files in <skill-path>/evals/:
name: Variable declaration ordering
prompt: "Write a Terraform variable block for a project_id string"
files:
- path: "main.tf"
content: |
resource "aws_instance" "web" {}
criteria:
- "Variable block has description as the first field"
- "Uses type = string"
- "Does not include a default value for required variables"
expect_skill: true
timeout: 120
Fields
| Field | Required | Default | Description |
|---|---|---|---|
name | Yes | filename | Human-readable case name |
prompt | Yes | - | The user prompt sent to Claude |
criteria | Yes | - | List of pass/fail assertions |
expect_skill | No | true | Whether the skill should trigger |
timeout | No | 120 | Timeout in seconds for this case |
files | No | - | Temp files created before execution |
Writing good criteria
Criteria are graded by a separate Claude call. Write them to be specific and verifiable:
# Good - specific, verifiable
criteria:
- "Output contains a valid HCL resource block"
- "Uses for_each, not count, for multiple resources"
- "Tags include Environment and ManagedBy keys"
# Bad - vague, subjective
criteria:
- "Output is good"
- "Follows best practices"
- "Code is clean"
Negative trigger cases
Include at least one case that should NOT trigger the skill:
name: Negative trigger - Python question
prompt: "How do I read a CSV file in Python?"
expect_skill: false
criteria:
- "Response does NOT reference Terraform or HCL"
- "Response provides Python-related guidance"
timeout: 30
Providing context files
Use files to create temporary files before execution:
name: Improve existing Dockerfile
prompt: "Optimize this Dockerfile for production"
files:
- path: "Dockerfile"
content: |
FROM node:latest
COPY . .
RUN npm install
CMD ["node", "index.js"]
criteria:
- "Uses multi-stage build"
- "Does not use :latest tag"
- "Runs as non-root user"
How many cases?
- Minimum: 3-5 per skill
- At least 1 negative trigger case
- Cover: happy path, edge cases, error handling
- More cases = more reliable signal, but higher cost
Tips
- Keep prompts realistic - use the same phrasing a real user would type
- Criteria should check the output, not the process
- Use
timeout: 30for negative trigger cases (they're fast) - If a criterion fails consistently, the skill needs improvement (not the eval)