Automated evaluation for Claude Code skills. Grade responses with evidence, track regressions, and report results directly in pull requests.
- uses: skill-bench/skill-eval-action@v1
with:
skill-name: my-skill
skill-path: ./skills/my-skill
anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
Everything you need to test, grade, and ship reliable AI skills.
Runs each eval case via claude -p with configurable timeouts and automatic retries with exponential backoff.
A separate grader scores each criterion with quoted evidence from the response. No guessing.
Pass/fail results posted directly to pull requests with detailed breakdowns per skill.
HTML artifact with full grading details, benchmark comparisons, and drill-down data.
Matrix strategy for evaluating multiple skills simultaneously across your CI pipeline.
Only evaluate changed skills on PRs. Skip unchanged code to keep feedback loops fast.
Define test prompts and grading criteria in YAML files alongside your skills.
Drop the action into your workflow. Configure the skill path and API key.
Automated grading with evidence-backed scores posted as PR comments.
Open source. Free to use. Set up in under 5 minutes.