Non-determinism
LLM-based evals are non-deterministic. Each run, Claude generates a slightly different response, and the grader evaluates it slightly differently. The same skill without changes may produce different pass rates across runs.
This is why:
- The default
pass-thresholdis80not100 - The agentskills.io best practices say "occasional flakiness is expected"
- Multiple runs + aggregation gives a more reliable picture
Reducing flakiness
Relax criteria
Make criteria less brittle. Instead of requiring exact output, check for the presence of key patterns:
# Brittle - depends on exact SHA resolution
criteria:
- "All actions pinned to 40-char SHA"
# Resilient - accepts reasonable alternatives
criteria:
- "Uses SHA pinning or explains how to resolve SHAs"
Lower threshold
Accept that 70-80% is a realistic pass rate for LLM evals. Use the threshold as a regression detector, not a perfection checker.
Run multiple times
Aggregate results across multiple runs for a stable signal. A skill that passes 4/5 times at 80%+ is reliable.
Why not temperature=0?
Even with temperature=0, LLM outputs vary due to:
- Different caching states between runs
- Token sampling at very close probability boundaries
- Tool use decisions varying based on context ordering