LLMs anchor to 7-out-of-10.
Across 90+ hackathon submissions (codebases and demo videos, three events) and 342 BLS occupations, models picked numbers in a tight band centered on 7. Variance was high run-to-run, but the central tendency stayed locked. Calibration was decoration, not signal.
Asking nicely doesn't help.
"Be honest." "Use the full scale." "A 4 means genuinely excellent." We tried all of it. Public benchmarks like lechmazur/writing retired their absolute rubric scoring after 9,139 score rows showed it. The pattern is the model, not the prompt.
Don't ask for a number. Ask for evidence.
LLMs are reliable at one thing: finding bounded, signed observations with file:line or timestamp evidence. Collect those. Run them through a formula. The number you get out is stable across reruns, calibrated, and defensible.
Seven principles.
Each one rules out a way the LLM-pick-a-number failure mode sneaks back in. Together they define a scoring pipeline you can defend against an adversary who's read the prompt.
- 01
Separate observation from scoring
The LLM finds evidence. A formula, not the LLM, produces the score. The number is computed, not chosen.
- 02
Discrete signed impact items
Every piece of evidence gets one of {+5, +3, +2, +1, −1, −2, −3, −5}. Forces commitment. Removes the 7-out-of-10 anchor.
- 03
Diminishing returns (sqrt)
normalized = net_impact / sqrt(total_items). The 40th item adds less than the 4th. Evidence farming is punished.
- 04
Density-weighted confidence
Confidence = how much evidence the scorer found, not how sure the scorer feels. Sparse runs are visibly low-confidence.
- 05
Anchored center
Sparse-evidence runs regress toward 50. The multiplier never exceeds 1.0 — high evidence confirms, never amplifies beyond raw.
- 06
Bounded scale with self-check
Final scores live in [0, 100]. Across criteria, the spread must be ≥ 20 — otherwise the evaluator was not discriminating.
- 07
Separation of LLM and deterministic computation
Independent passes by different model families collect evidence. Math, not the LLM, combines them. Adversarial synthesis catches contradictions.
Discrete impact → diminishing returns → confidence-weighted.
Two application patterns. Use the pooled variant for simple checkers (one bucket of evidence, one final score). Use the per-criterion variant for matrix benchmarks (formula runs once per criterion; weighted average produces the overall).
# Per criterion
net_impact = sum(item.impact for item in items)
total_items = len(items)
normalized = net_impact / sqrt(total_items)
raw = clamp(50 + normalized * 8.0, 0, 100)
density = total_items / 20
multiplier = 0.75 + 0.25 * clamp(density, 0, 1) # never > 1.0
final = round(50 + (raw - 50) * multiplier)
confidence = clamp(density, 0, 1) # Across criteria (overall)
overall_score = round(sum(c.final * c.weight))
overall_confidence = min(c.confidence for c in criteria)
self_check_span = max(c.final) - min(c.final)
# must be >= 20Discrete impact set: {+5, +3, +2, +1, −1, −2, −3, −5}. Hard cap of 5 items per perspective per criterion per pass. The multiplier never exceeds 1.0 (confirms, never amplifies). Sparse evidence is visibly low-confidence, not silently confident.
Calibration anchors.
The 8.0 scale factor was tuned to put scores at familiar landmarks. The numbers below are not opinions — they fall out of the formula given each normalized_impact.
| normalized_impact | → raw_score | tier |
|---|---|---|
| +5.0 | ~90 | Exceptional (5–10% of submissions) |
| +2.5 | ~70 | Above average |
| 0.0 | ~50 | Average |
| −2.5 | ~30 | Below average |
| −5.0 | ~10 | Poor |
Run the math by hand.
Four small examples + one real-world rescoring. The toy cases show how the pieces interact; the rescoring shows what changes when an existing benchmark drops the LLM-picks-a-number step.
- A
Strong, abundant evidence
given · net_impact = +25, total_items = 25
- · normalized = 25 / sqrt(25) = 5.0
- · raw = 50 + (5.0 × 8.0) = 90
- · density = 25/20 = 1.25 → multiplier = 1.0
- · final = 90, confidence = 1.0
- B
Strong, sparse evidence
given · net_impact = +25, total_items = 4
- · normalized = 25 / sqrt(4) = 12.5
- · raw = 50 + (12.5 × 8.0) = 150 → clamped 100
- · density = 4/20 = 0.2 → multiplier = 0.80
- · final = round(50 + 50 × 0.80) = 90, confidence = 0.2
- C
Average
given · net_impact = 0, total_items = 20
- · normalized = 0
- · raw = 50
- · density = 1.0 → multiplier = 1.0
- · final = 50, confidence = 1.0
- D
Weak, well-evidenced
given · net_impact = −15, total_items = 25
- · normalized = −15 / sqrt(25) = −3.0
- · raw = 50 + (−3.0 × 8.0) = 26
- · density = 1.25 → multiplier = 1.0
- · final = 26, confidence = 1.0 (high confidence in a low score)
Rescoring case study: pbakaus/impeccable
full analysis →Paul Bakaus' frontend-design skill bundle scored UIs on the 10 Nielsen heuristics, 0–4 each, summed to a 0–40 band. Two judges (an LLM pass + a deterministic detector with 24 antipattern rules) ran in isolation — exactly the right architecture. But the LLM still picked the numbers. We replaced that step with signed-evidence-item collection and ran the standard formula.
Stable across runs. The page wasn't bad — but it wasn't 76. The vanilla score was carrying ambient charity.
Companion analysis: trycua/cua-bench
full analysis →Computer-use agent benchmark. Reward is a deterministic float — no LLM in the scoring path, so principle 7 is satisfied by construction. The remaining opportunity is principles 2–6: signed multi-signal evidence accumulation instead of single-signal pass/fail. The honest take: not every benchmark needs the full methodology. cua-bench is a partial fit, not a slam-dunk like impeccable.
v0.8 — calibration-conditional
Will it help on your model? Run the probe first.
The methodology is not a free lunch. We ran the v3 prompts across six model families (gpt-5.5, deepseek-flash, gemini-3-flash, gemma4, gpt-oss-20b, nemotron) on a held-out 77-submission hackathon dataset and observed a clean regime structure: principled rescoring wins on every metric for the most-inflated model (gemini-3-flash), can over-correct on already-calibrated models (gpt-5.5, deepseek-flash, gpt-oss-20b — the 3-of-6 over-correction pattern), and cannot rescue intrinsically weak signal (nemotron, the textbook PICKS_A_NUMBER case).
The fix is to run a 30-second probe first — a 20-item synthetic rating test, no ground truth required — and let the regime label tell you how much methodology to apply.
| Regime | What it looks like | Recommendation |
|---|---|---|
| CALIBRATED | range ≥ 5, both 1s and 9s show up | Lighter touch — Principles 1, 4, 5 only. |
| INFLATION_LIKELY | lots of 9s/10s, almost no 1s/2s | Full pipeline. Methodology was designed for this. |
| DEFLATION_LIKELY | lots of 1s/2s, almost no 9s/10s | Full pipeline + reduced counter-bias. |
| PICKS_A_NUMBER | tight cluster around one score (range ≤ 2) | Switch model. The formula cannot rescue weak signal. |
| JITTERY | same item, very different scores run-to-run | Ensemble first. Reliability is the constraint. |
Order of checks: JITTERY first (high run-to-run variance makes the shape rules unreliable), then INFLATION_LIKELY / DEFLATION_LIKELY (extreme clusters), then PICKS_A_NUMBER (compressed middle), then CALIBRATED. Full classifier thresholds and the empirical six-model grid are in Sections 5.5, 5.7, and Appendix E of the paper.
Run the probe.
Install calibration-probe, point it at your candidate scoring model, and read the regime label. If it lands CALIBRATED, use the lighter touch. If INFLATION_LIKELY, run the full pipeline. If PICKS_A_NUMBER or JITTERY, switch models — the formula cannot rescue intrinsically weak signal.
In production
A whole product runs on this formula.
MyBench is a 45-minute interview that turns your real work into a private, saturate-resistant AI benchmark suite — then scores it across model × harness combinations. The scoring engine is the seven-principle methodology on this page, reused unchanged. Same discrete impact set. Same sqrt normalization. Same 5×5 perspective × criterion matrix. Different domain, same math.
Stop reading model reviews. Build your own benchmark. Three to five tests tuned to your work, with planted traps and an evidence guide. Re-run weekly when a new model ships.
- Scoring
- 7-principle, this page
- Interview
- ~45 minutes
- Output
- 3–5 benchmarks + traps
- Axis
- model × harness
Why this matters here.
The methodology works on the things it was calibrated against (hackathon code, BLS occupations) — and on a domain it had nothing to do with. Same formula, different surface. That's the test for whether a scoring approach is general or just overfit to its training anecdotes.
If you want to see the seven principles applied end-to-end on a fresh problem before you install one of the skills below — start there.
One preflight, three ways to score.
One repo, four skills. calibration-probe is the 30-second preflight — run it first to find out which regime your model is in. Then pick the scoring skill that matches your input shape. All four are skills.sh-installable into Claude Code, Cursor, Goose, OpenCode, and any other skills-aware agent.
// preflight calibration-probe first — 30 seconds, tells you whether the methodology will help on your model.
// not sure which? what-works-feedback-judge is the simplest. hackathon-judge if you have a code submission. evidence-scoring if you're bringing your own domain.
calibration-probe
A 30-second preflight. Will the methodology even help on your model?
Synthetic 20-item rating test, repeated 30 times, no ground truth required. Classifies your candidate model into one of five regimes (CALIBRATED, INFLATION_LIKELY, DEFLATION_LIKELY, PICKS_A_NUMBER, JITTERY) and tells you whether to run the full pipeline, use a lighter touch, or switch models. Run before any of the three scoring skills below.
evidence-scoring
The seven-principle methodology, generic.
Bring your own domain. Define your matrix. The skill walks you through cataloging signed evidence items and runs the formula. The methodology, nothing prescribed about WHAT you score.
what-works-feedback-judge
A 4-question feedback loop. Score any draft.
Pre-baked Working / Not working / Missing / Confusing application. Hand it any draft, spec, plan, or pitch and get a 0–100 readiness score plus four grouped action lists. Iterate; v1 → v2 score delta tells you whether the revision actually moved.
hackathon-judge
Four-pass project judging — code, demo, math, mentoring.
Score any project submission with a codebase (and optional demo video) against a 5×5 evidence matrix. Independent passes prevent polish from masking thin code and vice versa. Math computes the scores; the team gets a grounded mentoring report. Reads a `calibration_regime` parameter from the probe to dial counter-bias up or down.
The paper.
Don't Let the LLM Pick a Number — methodology paper. Calibrated on 90+ hackathon submissions (codebases and demo videos, three events) and 342 BLS occupations across 9 models. Includes the full derivation, ablations, and the impeccable rescoring case study.
- Title
- Don't Let the LLM Pick a Number
- Status
- v0.8.0 draft
- Length
- ~12k words + 5 appendices
- Calibrated on
- 90+ hackathon submissions (3 events), 342 BLS occupations
- Models tested
- 9 frontier models
- License
- MIT (markdown source)