Non-determinism

LLM-based evals are non-deterministic. Each run, Claude generates a slightly different response, and the grader evaluates it slightly differently. The same skill without changes may produce different pass rates across runs.

This is why:

The default pass-threshold is 80 not 100
The agentskills.io best practices say "occasional flakiness is expected"
Multiple runs + aggregation gives a more reliable picture

Reducing flakiness

Relax criteria

Make criteria less brittle. Instead of requiring exact output, check for the presence of key patterns:

# Brittle - depends on exact SHA resolution
criteria:
  - "All actions pinned to 40-char SHA"

# Resilient - accepts reasonable alternatives
criteria:
  - "Uses SHA pinning or explains how to resolve SHAs"

Lower threshold

Accept that 70-80% is a realistic pass rate for LLM evals. Use the threshold as a regression detector, not a perfection checker.

Run multiple times

Aggregate results across multiple runs for a stable signal. A skill that passes 4/5 times at 80%+ is reliable.

Why not temperature=0?

Even with temperature=0, LLM outputs vary due to:

Different caching states between runs
Token sampling at very close probability boundaries
Tool use decisions varying based on context ordering

Reducing flakiness​

Relax criteria​

Lower threshold​

Run multiple times​

Why not temperature=0?​

Reducing flakiness

Relax criteria

Lower threshold

Run multiple times

Why not temperature=0?