Skip to main content

Non-determinism

LLM-based evals are non-deterministic. Each run, Claude generates a slightly different response, and the grader evaluates it slightly differently. The same skill without changes may produce different pass rates across runs.

This is why:

  • The default pass-threshold is 80 not 100
  • The agentskills.io best practices say "occasional flakiness is expected"
  • Multiple runs + aggregation gives a more reliable picture

Reducing flakiness

Relax criteria

Make criteria less brittle. Instead of requiring exact output, check for the presence of key patterns:

# Brittle - depends on exact SHA resolution
criteria:
- "All actions pinned to 40-char SHA"

# Resilient - accepts reasonable alternatives
criteria:
- "Uses SHA pinning or explains how to resolve SHAs"

Lower threshold

Accept that 70-80% is a realistic pass rate for LLM evals. Use the threshold as a regression detector, not a perfection checker.

Run multiple times

Aggregate results across multiple runs for a stable signal. A skill that passes 4/5 times at 80%+ is reliable.

Why not temperature=0?

Even with temperature=0, LLM outputs vary due to:

  • Different caching states between runs
  • Token sampling at very close probability boundaries
  • Tool use decisions varying based on context ordering