Benchmark your AI agent skills

Automated evaluation for Claude Code skills. Grade responses with evidence, track regressions, and report results directly in pull requests.

Get Started View on Marketplace

- uses: skill-bench/skill-eval-action@v1
  with:
    skill-name: my-skill
    skill-path: ./skills/my-skill
    anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}

Built for skill developers

Everything you need to test, grade, and ship reliable AI skills.

Automated Execution

Runs each eval case via claude -p with configurable timeouts and automatic retries with exponential backoff.

Evidence-Based Grading

A separate grader scores each criterion with quoted evidence from the response. No guessing.

PR Reporting

Pass/fail results posted directly to pull requests with detailed breakdowns per skill.

Interactive Viewer

HTML artifact with full grading details, benchmark comparisons, and drill-down data.

Parallel Execution

Matrix strategy for evaluating multiple skills simultaneously across your CI pipeline.

Smart Targeting

Only evaluate changed skills on PRs. Skip unchanged code to keep feedback loops fast.

How it works

Write eval cases

Define test prompts and grading criteria in YAML files alongside your skills.

Add the GitHub Action

Drop the action into your workflow. Configure the skill path and API key.

Get results on every PR

Automated grading with evidence-backed scores posted as PR comments.

Start evaluating your skills

Open source. Free to use. Set up in under 5 minutes.

Read the Docs View Source