Don't Let the LLM Pick a Number

LLMs anchor to 7-out-of-10.

Across 90+ hackathon submissions (codebases and demo videos, three events) and 342 BLS occupations, models picked numbers in a tight band centered on 7. Variance was high run-to-run, but the central tendency stayed locked. Calibration was decoration, not signal.

Asking nicely doesn't help.

"Be honest." "Use the full scale." "A 4 means genuinely excellent." We tried all of it. Public benchmarks like lechmazur/writing retired their absolute rubric scoring after 9,139 score rows showed it. The pattern is the model, not the prompt.

Don't ask for a number. Ask for evidence.

LLMs are reliable at one thing: finding bounded, signed observations with file:line or timestamp evidence. Collect those. Run them through a formula. The number you get out is stable across reruns, calibrated, and defensible.

Seven principles.

Each one rules out a way the LLM-pick-a-number failure mode sneaks back in. Together they define a scoring pipeline you can defend against an adversary who's read the prompt.

01
Separate observation from scoring

The LLM finds evidence. A formula, not the LLM, produces the score. The number is computed, not chosen.
02
Discrete signed impact items

Every piece of evidence gets one of {+5, +3, +2, +1, −1, −2, −3, −5}. Forces commitment. Removes the 7-out-of-10 anchor.
03
Diminishing returns (sqrt)

normalized = net_impact / sqrt(total_items). The 40th item adds less than the 4th. Evidence farming is punished.
04
Density-weighted confidence

Confidence = how much evidence the scorer found, not how sure the scorer feels. Sparse runs are visibly low-confidence.
05
Anchored center

Sparse-evidence runs regress toward 50. The multiplier never exceeds 1.0 — high evidence confirms, never amplifies beyond raw.
06
Bounded scale with self-check

Final scores live in [0, 100]. Across criteria, the spread must be ≥ 20 — otherwise the evaluator was not discriminating.
07
Separation of LLM and deterministic computation

Independent passes by different model families collect evidence. Math, not the LLM, combines them. Adversarial synthesis catches contradictions.

Discrete impact → diminishing returns → confidence-weighted.

Two application patterns. Use the pooled variant for simple checkers (one bucket of evidence, one final score). Use the per-criterion variant for matrix benchmarks (formula runs once per criterion; weighted average produces the overall).

# Per criterion
net_impact         = sum(item.impact for item in items)
total_items        = len(items)
normalized         = net_impact / sqrt(total_items)
raw                = clamp(50 + normalized * 8.0, 0, 100)
density            = total_items / 20
multiplier         = 0.75 + 0.25 * clamp(density, 0, 1)   # never > 1.0
final              = round(50 + (raw - 50) * multiplier)
confidence         = clamp(density, 0, 1)

# Across criteria (overall)
overall_score       = round(sum(c.final * c.weight))
overall_confidence  = min(c.confidence for c in criteria)
self_check_span     = max(c.final) - min(c.final)
                      # must be >= 20

Discrete impact set: {+5, +3, +2, +1, −1, −2, −3, −5}. Hard cap of 5 items per perspective per criterion per pass. The multiplier never exceeds 1.0 (confirms, never amplifies). Sparse evidence is visibly low-confidence, not silently confident.

Calibration anchors.

The 8.0 scale factor was tuned to put scores at familiar landmarks. The numbers below are not opinions — they fall out of the formula given each normalized_impact.

normalized_impact	→ raw_score	tier
+5.0	~90	Exceptional (5–10% of submissions)
+2.5	~70	Above average
0.0	~50	Average
−2.5	~30	Below average
−5.0	~10	Poor

Run the math by hand.

Four small examples + one real-world rescoring. The toy cases show how the pieces interact; the rescoring shows what changes when an existing benchmark drops the LLM-picks-a-number step.

A
Strong, abundant evidence

given · net_impact = +25, total_items = 25
- · normalized = 25 / sqrt(25) = 5.0
- · raw = 50 + (5.0 × 8.0) = 90
- · density = 25/20 = 1.25 → multiplier = 1.0
- · final = 90, confidence = 1.0
B
Strong, sparse evidence

given · net_impact = +25, total_items = 4
- · normalized = 25 / sqrt(4) = 12.5
- · raw = 50 + (12.5 × 8.0) = 150 → clamped 100
- · density = 4/20 = 0.2 → multiplier = 0.80
- · final = round(50 + 50 × 0.80) = 90, confidence = 0.2
C
Average

given · net_impact = 0, total_items = 20
- · normalized = 0
- · raw = 50
- · density = 1.0 → multiplier = 1.0
- · final = 50, confidence = 1.0
D
Weak, well-evidenced

given · net_impact = −15, total_items = 25
- · normalized = −15 / sqrt(25) = −3.0
- · raw = 50 + (−3.0 × 8.0) = 26
- · density = 1.25 → multiplier = 1.0
- · final = 26, confidence = 1.0 (high confidence in a low score)

Rescoring case study: pbakaus/impeccable

full analysis →

Paul Bakaus' frontend-design skill bundle scored UIs on the 10 Nielsen heuristics, 0–4 each, summed to a 0–40 band. Two judges (an LLM pass + a deterministic detector with 24 antipattern rules) ran in isolation — exactly the right architecture. But the LLM still picked the numbers. We replaced that step with signed-evidence-item collection and ran the standard formula.

vanilla 76 rescored 59 Δ 17 pts

Stable across runs. The page wasn't bad — but it wasn't 76. The vanilla score was carrying ambient charity.

Companion analysis: trycua/cua-bench

full analysis →

Computer-use agent benchmark. Reward is a deterministic float — no LLM in the scoring path, so principle 7 is satisfied by construction. The remaining opportunity is principles 2–6: signed multi-signal evidence accumulation instead of single-signal pass/fail. The honest take: not every benchmark needs the full methodology. cua-bench is a partial fit, not a slam-dunk like impeccable.

Three ways to use it.

One repo, three skills. Pick the one that matches your need. All three are skills.sh-installable into Claude Code, Cursor, Goose, OpenCode, and any other skills-aware agent.

// not sure? what-works-feedback-judge is the simplest. hackathon-judge if you have a code submission. evidence-scoring if you're bringing your own domain.

evidence-scoring

The seven-principle methodology, generic.

SKILL.md →

Bring your own domain. Define your matrix. The skill walks you through cataloging signed evidence items and runs the formula. The methodology, nothing prescribed about WHAT you score.

what-works-feedback-judge

A 4-question feedback loop. Score any draft.

SKILL.md →

Pre-baked Working / Not working / Missing / Confusing application. Hand it any draft, spec, plan, or pitch and get a 0–100 readiness score plus four grouped action lists. Iterate; v1 → v2 score delta tells you whether the revision actually moved.

hackathon-judge

Four-pass project judging — code, demo, math, mentoring.

SKILL.md →

Score any project submission with a codebase (and optional demo video) against a 5×5 evidence matrix. Independent passes prevent polish from masking thin code and vice versa. Math computes the scores; the team gets a grounded mentoring report.

The paper.

Don't Let the LLM Pick a Number — methodology paper. Calibrated on 90+ hackathon submissions (codebases and demo videos, three events) and 342 BLS occupations across 9 models. Includes the full derivation, ablations, and the impeccable rescoring case study.

Title: Don't Let the LLM Pick a Number
Status: v0.7 draft
Length: ~9k words + appendices
Calibrated on: 90+ hackathon submissions (3 events), 342 BLS occupations
Models tested: 9 frontier models
License: MIT (markdown source)

Read the paper

LLMs anchor to 7-out-of-10.

Asking nicely doesn't help.

Don't ask for a number. Ask for evidence.

Seven principles.

Separate observation from scoring

Discrete signed impact items

Diminishing returns (sqrt)

Density-weighted confidence

Anchored center

Bounded scale with self-check

Separation of LLM and deterministic computation

Discrete impact → diminishing returns → confidence-weighted.

Calibration anchors.

Run the math by hand.

Strong, abundant evidence

Strong, sparse evidence

Average

Weak, well-evidenced

Rescoring case study: pbakaus/impeccable

Companion analysis: trycua/cua-bench

Three ways to use it.

evidence-scoring

what-works-feedback-judge

hackathon-judge

The paper.