Writing Evals

Eval cases are YAML files that define test scenarios for your skills.

File format

Place .yaml or .yml files in <skill-path>/evals/. Two formats are supported.

Flat criteria (simple)

name: Variable declaration ordering
prompt: "Write a Terraform variable block for a project_id string"
files:
  - path: "main.tf"
    content: |
      resource "aws_instance" "web" {}
criteria:
  - "Variable block has description as the first field"
  - "Uses type = string"
  - "Does not include a default value for required variables"
expect_skill: true
timeout: 120

Rubric format (structured)

For richer grading with weighted criteria and specific pass conditions:

id: pc-01
category: factual/product-capabilities
prompt: |
  Does Workable include video interviews, assessments, and texting,
  or are those add-ons I'd have to pay extra for?
grading:
  rubric:
    - id: R1
      description: Correctly states these are included on Premier and Enterprise
      weight: 3
      pass_if: Response says these are included, or "built in," or "not add-ons"
    - id: R2
      description: Covers all three — video interviews, assessments, and texting/SMS
      weight: 2
      pass_if: All three feature types are addressed
  pass_threshold: 0.75
notes: |
  Source: plans.md — Add-ons table. This is context for eval authors only.

Rubric entries are automatically normalized for grading: description and pass_if are combined into a single criterion string.

Fields

Field	Required	Default	Description
`name`	Yes	filename	Human-readable case name
`prompt`	Yes	-	The user prompt sent to Claude
`criteria`	Yes*	-	List of pass/fail assertions
`grading`	Yes*	-	Structured rubric (alternative to `criteria`)
`expect_skill`	No	`true`	Whether the skill should trigger
`timeout`	No	`120`	Timeout in seconds for this case
`files`	No	-	Temp files created before execution
`notes`	No	-	Context for eval authors (not sent to grader)

* Provide either criteria or grading.rubric — one is required.

Writing good criteria

Criteria are graded by a separate Claude call. Write them to be specific and verifiable:

# Good - specific, verifiable
criteria:
  - "Output contains a valid HCL resource block"
  - "Uses for_each, not count, for multiple resources"
  - "Tags include Environment and ManagedBy keys"

# Bad - vague, subjective
criteria:
  - "Output is good"
  - "Follows best practices"
  - "Code is clean"

With rubric format, use pass_if for more precise grading instructions:

grading:
  rubric:
    - id: R1
      description: Uses for_each for multiple resources
      weight: 3
      pass_if: Response uses for_each with a map or set, not count with an index

Negative trigger cases

Include at least one case that should NOT trigger the skill:

name: Negative trigger - Python question
prompt: "How do I read a CSV file in Python?"
expect_skill: false
criteria:
  - "Response does NOT reference Terraform or HCL"
  - "Response provides Python-related guidance"
timeout: 30

Providing context files

Use files to create temporary files before execution:

name: Improve existing Dockerfile
prompt: "Optimize this Dockerfile for production"
files:
  - path: "Dockerfile"
    content: |
      FROM node:latest
      COPY . .
      RUN npm install
      CMD ["node", "index.js"]
criteria:
  - "Uses multi-stage build"
  - "Does not use :latest tag"
  - "Runs as non-root user"

How many cases?

Minimum: 3-5 per skill
At least 1 negative trigger case
Cover: happy path, edge cases, error handling
More cases = more reliable signal, but higher cost

Tips

Keep prompts realistic - use the same phrasing a real user would type
Criteria should check the output, not the process
Use timeout: 30 for negative trigger cases (they're fast)
If a criterion fails consistently, the skill needs improvement (not the eval)

File format​

Flat criteria (simple)​

Rubric format (structured)​

Fields​

Writing good criteria​

Negative trigger cases​

Providing context files​

How many cases?​

Tips​