Skip to main content

Writing Evals

Eval cases are YAML files that define test scenarios for your skills.

File format

Place YAML files in <skill-path>/evals/:

name: Variable declaration ordering
prompt: "Write a Terraform variable block for a project_id string"
files:
- path: "main.tf"
content: |
resource "aws_instance" "web" {}
criteria:
- "Variable block has description as the first field"
- "Uses type = string"
- "Does not include a default value for required variables"
expect_skill: true
timeout: 120

Fields

FieldRequiredDefaultDescription
nameYesfilenameHuman-readable case name
promptYes-The user prompt sent to Claude
criteriaYes-List of pass/fail assertions
expect_skillNotrueWhether the skill should trigger
timeoutNo120Timeout in seconds for this case
filesNo-Temp files created before execution

Writing good criteria

Criteria are graded by a separate Claude call. Write them to be specific and verifiable:

# Good - specific, verifiable
criteria:
- "Output contains a valid HCL resource block"
- "Uses for_each, not count, for multiple resources"
- "Tags include Environment and ManagedBy keys"

# Bad - vague, subjective
criteria:
- "Output is good"
- "Follows best practices"
- "Code is clean"

Negative trigger cases

Include at least one case that should NOT trigger the skill:

name: Negative trigger - Python question
prompt: "How do I read a CSV file in Python?"
expect_skill: false
criteria:
- "Response does NOT reference Terraform or HCL"
- "Response provides Python-related guidance"
timeout: 30

Providing context files

Use files to create temporary files before execution:

name: Improve existing Dockerfile
prompt: "Optimize this Dockerfile for production"
files:
- path: "Dockerfile"
content: |
FROM node:latest
COPY . .
RUN npm install
CMD ["node", "index.js"]
criteria:
- "Uses multi-stage build"
- "Does not use :latest tag"
- "Runs as non-root user"

How many cases?

  • Minimum: 3-5 per skill
  • At least 1 negative trigger case
  • Cover: happy path, edge cases, error handling
  • More cases = more reliable signal, but higher cost

Tips

  • Keep prompts realistic - use the same phrasing a real user would type
  • Criteria should check the output, not the process
  • Use timeout: 30 for negative trigger cases (they're fast)
  • If a criterion fails consistently, the skill needs improvement (not the eval)