March 31, 2026Trust systemsChessIQProduct engineering

How I Validate Move Classification Quality

Sanity checks, confidence bands, and regression tests for move labels.

Move classification quality is not just about choosing centipawn thresholds.

It is also about knowing when not to trust a harsh label yet.

That changed how I built classification in ChessIQ. The system starts with deterministic CPL buckets, then applies safeguards that reduce overconfident judgments when engine evidence is weak or ambiguous.

Why raw CPL is not enough

A naive classifier maps centipawn loss directly to labels and stops there.

My baseline buckets are straightforward:

excellent: CPL < 12
good: CPL < 45
inaccuracy: CPL < 120
mistake: CPL < 260
blunder: CPL >= 260

That gives consistency, but not reliability on its own. Search depth, tactical context, evaluation stability, and candidate-move ambiguity all affect how trustworthy a label is.

The safeguards that sit on top

After the base label is computed, the classifier applies guardrails:

Minimum depth guard: MIN_CLASSIFICATION_DEPTH = 16
Eval stability guard: EVAL_STABILITY_DELTA_PAWNS = 0.30
MultiPV ambiguity guard: MULTIPV_AMBIGUITY_GAP_PAWNS = 0.15
Obvious blunder exception: OBVIOUS_BLUNDER_CPL = 500

These rules are there to prevent false precision.

Where naive classifiers drift

The failure modes are predictable:

Shallow search over-penalizes: early passes can label moves too harshly.
Near-tied candidate lines: tiny MultiPV gaps mean the move may not be clearly worse.
Mate context breaks CPL intuition: mate lines should override normal centipawn logic.
Unstable evals reduce certainty: if evaluations are still moving significantly, severity should be compressed.

Confidence bands in practice

I do not treat every label as equally certain.

High-confidence situations include:

Mate-forced outcomes
Depth at or above 16
Clear MultiPV separation
Stable evaluations across passes

Lower-confidence situations include:

Depth below 16
Best-eval shifts above 0.30 pawns
MultiPV gaps below 0.15 pawns

In lower-confidence cases, the classifier softens severity instead of pretending certainty.

Mate and blunder handling

Two rules matter a lot:

If the best line is mate and the played line is not, classification is forced to blunder.
If the played line finds mate where ordinary eval logic would miss it, classification can be forced to best.

At the same time, obvious shallow blunders are preserved:

If CPL is at least 500, or a forced mate was missed, shallow-search softening does not erase the blunder.

There is also an anti-washout rule for severe errors: a move that started as a blunder can be softened once, but not repeatedly downgraded into a mild label.

Diagnostics: why a move got its label

I needed the classifier to be auditable, not opaque.

So classification has a diagnostics path that records:

cpl
startedAsBlunder
rawClassification
finalClassification
steps

The steps trail captures reasoning such as:

base=blunder
mate-forced-blunder
shallow-blunder-soften
stability-guard
multipv-ambiguity
downgrade:mistake

This makes tuning safer and debugging far easier.

Regression tests for fragile edges

Threshold systems drift unless edge cases are locked down.

That is why move classification includes targeted regression tests around:

Shallow blunder softening behavior
Mate overrides
Ambiguity and stability interactions
Obvious blunder preservation
Missing optional signals (no phantom downgrades)
Baseline CPL bucket mapping

I rely on these tests so rule changes do not quietly rewrite classification behavior.

What I do and do not claim

I do not claim a perfect, universal ground-truth label for every move.

I do claim that the system is explicit and defensible:

Deterministic baseline thresholds
Context-aware safeguards
Diagnostic reasoning trails
Regression coverage for known failure modes

For me, that is what useful move classification looks like: consistent rules, explicit guardrails, and the humility to soften judgment when evidence is shaky.