Back to Field Notes
EngineeringProductChess

How I Validate Move Classification Quality

Sanity checks, confidence bands, and regression tests for move labels.

Move labels start with deterministic centipawn thresholds, then get softened or overridden based on depth, stability, mate context, and MultiPV ambiguity.

Move classification quality is not just about choosing centipawn thresholds.

It is also about knowing when not to trust a harsh label yet.

That changed how I built classification in ChessIQ. The system starts with deterministic CPL buckets, then applies safeguards that reduce overconfident judgments when engine evidence is weak or ambiguous.

Why raw CPL is not enough

A naive classifier maps centipawn loss directly to labels and stops there.

My baseline buckets are straightforward:

  • excellent: CPL < 12
  • good: CPL < 45
  • inaccuracy: CPL < 120
  • mistake: CPL < 260
  • blunder: CPL >= 260

That gives consistency, but not reliability on its own. Search depth, tactical context, evaluation stability, and candidate-move ambiguity all affect how trustworthy a label is.

The safeguards that sit on top

After the base label is computed, the classifier applies guardrails:

  • Minimum depth guard: MIN_CLASSIFICATION_DEPTH = 16
  • Eval stability guard: EVAL_STABILITY_DELTA_PAWNS = 0.30
  • MultiPV ambiguity guard: MULTIPV_AMBIGUITY_GAP_PAWNS = 0.15
  • Obvious blunder exception: OBVIOUS_BLUNDER_CPL = 500

These rules are there to prevent false precision.

Where naive classifiers drift

The failure modes are predictable:

  • Shallow search over-penalizes: early passes can label moves too harshly.
  • Near-tied candidate lines: tiny MultiPV gaps mean the move may not be clearly worse.
  • Mate context breaks CPL intuition: mate lines should override normal centipawn logic.
  • Unstable evals reduce certainty: if evaluations are still moving significantly, severity should be compressed.

Confidence bands in practice

I do not treat every label as equally certain.

High-confidence situations include:

  • Mate-forced outcomes
  • Depth at or above 16
  • Clear MultiPV separation
  • Stable evaluations across passes

Lower-confidence situations include:

  • Depth below 16
  • Best-eval shifts above 0.30 pawns
  • MultiPV gaps below 0.15 pawns

In lower-confidence cases, the classifier softens severity instead of pretending certainty.

Mate and blunder handling

Two rules matter a lot:

  • If the best line is mate and the played line is not, classification is forced to blunder.
  • If the played line finds mate where ordinary eval logic would miss it, classification can be forced to best.

At the same time, obvious shallow blunders are preserved:

  • If CPL is at least 500, or a forced mate was missed, shallow-search softening does not erase the blunder.

There is also an anti-washout rule for severe errors: a move that started as a blunder can be softened once, but not repeatedly downgraded into a mild label.

Diagnostics: why a move got its label

I needed the classifier to be auditable, not opaque.

So classification has a diagnostics path that records:

  • cpl
  • startedAsBlunder
  • rawClassification
  • finalClassification
  • steps

The steps trail captures reasoning such as:

  • base=blunder
  • mate-forced-blunder
  • shallow-blunder-soften
  • stability-guard
  • multipv-ambiguity
  • downgrade:mistake

This makes tuning safer and debugging far easier.

Regression tests for fragile edges

Threshold systems drift unless edge cases are locked down.

That is why move classification includes targeted regression tests around:

  • Shallow blunder softening behavior
  • Mate overrides
  • Ambiguity and stability interactions
  • Obvious blunder preservation
  • Missing optional signals (no phantom downgrades)
  • Baseline CPL bucket mapping

I rely on these tests so rule changes do not quietly rewrite classification behavior.

What I do and do not claim

I do not claim a perfect, universal ground-truth label for every move.

I do claim that the system is explicit and defensible:

  • Deterministic baseline thresholds
  • Context-aware safeguards
  • Diagnostic reasoning trails
  • Regression coverage for known failure modes

For me, that is what useful move classification looks like: consistent rules, explicit guardrails, and the humility to soften judgment when evidence is shaky.