90+ submissions 342 BLS occupations 9 frontier models 5 regimes

Don't let the LLM pick a number.

Ask an LLM to score something on a 0–10 scale and you'll get a 7. Ask again, you'll get a 7. Vary the input and you'll still get a 7. The model isn't grading — it's anchoring. This is a methodology and a working set of tools that fix the problem the same way every time: the LLM finds evidence; math computes the score.

v0.8 adds a 30-second calibration probe that tells you whether the methodology will help on your specific model before you invest in the full pipeline.

Install the default skill

LLMs anchor to 7-out-of-10.

Across 90+ hackathon submissions (codebases and demo videos, three events) and 342 BLS occupations, models picked numbers in a tight band centered on 7. Variance was high run-to-run, but the central tendency stayed locked. Calibration was decoration, not signal.

Asking nicely doesn't help.

"Be honest." "Use the full scale." "A 4 means genuinely excellent." We tried all of it. Public benchmarks like lechmazur/writing retired their absolute rubric scoring after 9,139 score rows showed it. The pattern is the model, not the prompt.

Don't ask for a number. Ask for evidence.

LLMs are reliable at one thing: finding bounded, signed observations with file:line or timestamp evidence. Collect those. Run them through a formula. The number you get out is stable across reruns, calibrated, and defensible.

Seven principles.

Each one rules out a way the LLM-pick-a-number failure mode sneaks back in. Together they define a scoring pipeline you can defend against an adversary who's read the prompt.

  1. 01

    Separate observation from scoring

    The LLM finds evidence. A formula, not the LLM, produces the score. The number is computed, not chosen.

  2. 02

    Discrete signed impact items

    Every piece of evidence gets one of {+5, +3, +2, +1, −1, −2, −3, −5}. Forces commitment. Removes the 7-out-of-10 anchor.

  3. 03

    Diminishing returns (sqrt)

    normalized = net_impact / sqrt(total_items). The 40th item adds less than the 4th. Evidence farming is punished.

  4. 04

    Density-weighted confidence

    Confidence = how much evidence the scorer found, not how sure the scorer feels. Sparse runs are visibly low-confidence.

  5. 05

    Anchored center

    Sparse-evidence runs regress toward 50. The multiplier never exceeds 1.0 — high evidence confirms, never amplifies beyond raw.

  6. 06

    Bounded scale with self-check

    Final scores live in [0, 100]. Across criteria, the spread must be ≥ 20 — otherwise the evaluator was not discriminating.

  7. 07

    Separation of LLM and deterministic computation

    Independent passes by different model families collect evidence. Math, not the LLM, combines them. Adversarial synthesis catches contradictions.

Discrete impact → diminishing returns → confidence-weighted.

Two application patterns. Use the pooled variant for simple checkers (one bucket of evidence, one final score). Use the per-criterion variant for matrix benchmarks (formula runs once per criterion; weighted average produces the overall).

# Per criterion
net_impact         = sum(item.impact for item in items)
total_items        = len(items)
normalized         = net_impact / sqrt(total_items)
raw                = clamp(50 + normalized * 8.0, 0, 100)
density            = total_items / 20
multiplier         = 0.75 + 0.25 * clamp(density, 0, 1)   # never > 1.0
final              = round(50 + (raw - 50) * multiplier)
confidence         = clamp(density, 0, 1)
# Across criteria (overall)
overall_score       = round(sum(c.final * c.weight))
overall_confidence  = min(c.confidence for c in criteria)
self_check_span     = max(c.final) - min(c.final)
                      # must be >= 20

Discrete impact set: {+5, +3, +2, +1, −1, −2, −3, −5}. Hard cap of 5 items per perspective per criterion per pass. The multiplier never exceeds 1.0 (confirms, never amplifies). Sparse evidence is visibly low-confidence, not silently confident.

Calibration anchors.

The 8.0 scale factor was tuned to put scores at familiar landmarks. The numbers below are not opinions — they fall out of the formula given each normalized_impact.

normalized_impact→ raw_scoretier
+5.0~90Exceptional (5–10% of submissions)
+2.5~70Above average
0.0~50Average
−2.5~30Below average
−5.0~10Poor

Run the math by hand.

Four small examples + one real-world rescoring. The toy cases show how the pieces interact; the rescoring shows what changes when an existing benchmark drops the LLM-picks-a-number step.

  1. A

    Strong, abundant evidence

    given · net_impact = +25, total_items = 25

    • · normalized = 25 / sqrt(25) = 5.0
    • · raw = 50 + (5.0 × 8.0) = 90
    • · density = 25/20 = 1.25 → multiplier = 1.0
    • · final = 90, confidence = 1.0
  2. B

    Strong, sparse evidence

    given · net_impact = +25, total_items = 4

    • · normalized = 25 / sqrt(4) = 12.5
    • · raw = 50 + (12.5 × 8.0) = 150 → clamped 100
    • · density = 4/20 = 0.2 → multiplier = 0.80
    • · final = round(50 + 50 × 0.80) = 90, confidence = 0.2
  3. C

    Average

    given · net_impact = 0, total_items = 20

    • · normalized = 0
    • · raw = 50
    • · density = 1.0 → multiplier = 1.0
    • · final = 50, confidence = 1.0
  4. D

    Weak, well-evidenced

    given · net_impact = −15, total_items = 25

    • · normalized = −15 / sqrt(25) = −3.0
    • · raw = 50 + (−3.0 × 8.0) = 26
    • · density = 1.25 → multiplier = 1.0
    • · final = 26, confidence = 1.0 (high confidence in a low score)

Rescoring case study: pbakaus/impeccable

full analysis →

Paul Bakaus' frontend-design skill bundle scored UIs on the 10 Nielsen heuristics, 0–4 each, summed to a 0–40 band. Two judges (an LLM pass + a deterministic detector with 24 antipattern rules) ran in isolation — exactly the right architecture. But the LLM still picked the numbers. We replaced that step with signed-evidence-item collection and ran the standard formula.

vanilla 76 rescored 59 Δ 17 pts

Stable across runs. The page wasn't bad — but it wasn't 76. The vanilla score was carrying ambient charity.

Companion analysis: trycua/cua-bench

full analysis →

Computer-use agent benchmark. Reward is a deterministic float — no LLM in the scoring path, so principle 7 is satisfied by construction. The remaining opportunity is principles 2–6: signed multi-signal evidence accumulation instead of single-signal pass/fail. The honest take: not every benchmark needs the full methodology. cua-bench is a partial fit, not a slam-dunk like impeccable.

v0.8 — calibration-conditional

Will it help on your model? Run the probe first.

The methodology is not a free lunch. We ran the v3 prompts across six model families (gpt-5.5, deepseek-flash, gemini-3-flash, gemma4, gpt-oss-20b, nemotron) on a held-out 77-submission hackathon dataset and observed a clean regime structure: principled rescoring wins on every metric for the most-inflated model (gemini-3-flash), can over-correct on already-calibrated models (gpt-5.5, deepseek-flash, gpt-oss-20b — the 3-of-6 over-correction pattern), and cannot rescue intrinsically weak signal (nemotron, the textbook PICKS_A_NUMBER case).

The fix is to run a 30-second probe first — a 20-item synthetic rating test, no ground truth required — and let the regime label tell you how much methodology to apply.

RegimeWhat it looks likeRecommendation
CALIBRATEDrange ≥ 5, both 1s and 9s show upLighter touch — Principles 1, 4, 5 only.
INFLATION_LIKELYlots of 9s/10s, almost no 1s/2sFull pipeline. Methodology was designed for this.
DEFLATION_LIKELYlots of 1s/2s, almost no 9s/10sFull pipeline + reduced counter-bias.
PICKS_A_NUMBERtight cluster around one score (range ≤ 2)Switch model. The formula cannot rescue weak signal.
JITTERYsame item, very different scores run-to-runEnsemble first. Reliability is the constraint.

Order of checks: JITTERY first (high run-to-run variance makes the shape rules unreliable), then INFLATION_LIKELY / DEFLATION_LIKELY (extreme clusters), then PICKS_A_NUMBER (compressed middle), then CALIBRATED. Full classifier thresholds and the empirical six-model grid are in Sections 5.5, 5.7, and Appendix E of the paper.

Run the probe.

Install calibration-probe, point it at your candidate scoring model, and read the regime label. If it lands CALIBRATED, use the lighter touch. If INFLATION_LIKELY, run the full pipeline. If PICKS_A_NUMBER or JITTERY, switch models — the formula cannot rescue intrinsically weak signal.

In production

A whole product runs on this formula.

MyBench is a 45-minute interview that turns your real work into a private, saturate-resistant AI benchmark suite — then scores it across model × harness combinations. The scoring engine is the seven-principle methodology on this page, reused unchanged. Same discrete impact set. Same sqrt normalization. Same 5×5 perspective × criterion matrix. Different domain, same math.

M
MyBench
GitHub →

Stop reading model reviews. Build your own benchmark. Three to five tests tuned to your work, with planted traps and an evidence guide. Re-run weekly when a new model ships.

Scoring
7-principle, this page
Interview
~45 minutes
Output
3–5 benchmarks + traps
Axis
model × harness
Visit mybench.codefiworks.com

Why this matters here.

The methodology works on the things it was calibrated against (hackathon code, BLS occupations) — and on a domain it had nothing to do with. Same formula, different surface. That's the test for whether a scoring approach is general or just overfit to its training anecdotes.

If you want to see the seven principles applied end-to-end on a fresh problem before you install one of the skills below — start there.

One preflight, three ways to score.

One repo, four skills. calibration-probe is the 30-second preflight — run it first to find out which regime your model is in. Then pick the scoring skill that matches your input shape. All four are skills.sh-installable into Claude Code, Cursor, Goose, OpenCode, and any other skills-aware agent.

// preflight calibration-probe first — 30 seconds, tells you whether the methodology will help on your model.
// not sure which? what-works-feedback-judge is the simplest. hackathon-judge if you have a code submission. evidence-scoring if you're bringing your own domain.

calibration-probe

A 30-second preflight. Will the methodology even help on your model?

SKILL.md →

Synthetic 20-item rating test, repeated 30 times, no ground truth required. Classifies your candidate model into one of five regimes (CALIBRATED, INFLATION_LIKELY, DEFLATION_LIKELY, PICKS_A_NUMBER, JITTERY) and tells you whether to run the full pipeline, use a lighter touch, or switch models. Run before any of the three scoring skills below.

evidence-scoring

The seven-principle methodology, generic.

SKILL.md →

Bring your own domain. Define your matrix. The skill walks you through cataloging signed evidence items and runs the formula. The methodology, nothing prescribed about WHAT you score.

what-works-feedback-judge

A 4-question feedback loop. Score any draft.

SKILL.md →

Pre-baked Working / Not working / Missing / Confusing application. Hand it any draft, spec, plan, or pitch and get a 0–100 readiness score plus four grouped action lists. Iterate; v1 → v2 score delta tells you whether the revision actually moved.

hackathon-judge

Four-pass project judging — code, demo, math, mentoring.

SKILL.md →

Score any project submission with a codebase (and optional demo video) against a 5×5 evidence matrix. Independent passes prevent polish from masking thin code and vice versa. Math computes the scores; the team gets a grounded mentoring report. Reads a `calibration_regime` parameter from the probe to dial counter-bias up or down.

The paper.

Don't Let the LLM Pick a Number — methodology paper. Calibrated on 90+ hackathon submissions (codebases and demo videos, three events) and 342 BLS occupations across 9 models. Includes the full derivation, ablations, and the impeccable rescoring case study.

Title
Don't Let the LLM Pick a Number
Status
v0.8.0 draft
Length
~12k words + 5 appendices
Calibrated on
90+ hackathon submissions (3 events), 342 BLS occupations
Models tested
9 frontier models
License
MIT (markdown source)
Read the paper