P
pickanumber v0.7 · draft
90+ submissions 342 BLS occupations 9 frontier models

Don't let the LLM pick a number.

Ask an LLM to score something on a 0–10 scale and you'll get a 7. Ask again, you'll get a 7. Vary the input and you'll still get a 7. The model isn't grading — it's anchoring. This is a methodology and a working set of tools that fix the problem the same way every time: the LLM finds evidence; math computes the score.

Install the default skill

LLMs anchor to 7-out-of-10.

Across 90+ hackathon submissions (codebases and demo videos, three events) and 342 BLS occupations, models picked numbers in a tight band centered on 7. Variance was high run-to-run, but the central tendency stayed locked. Calibration was decoration, not signal.

Asking nicely doesn't help.

"Be honest." "Use the full scale." "A 4 means genuinely excellent." We tried all of it. Public benchmarks like lechmazur/writing retired their absolute rubric scoring after 9,139 score rows showed it. The pattern is the model, not the prompt.

Don't ask for a number. Ask for evidence.

LLMs are reliable at one thing: finding bounded, signed observations with file:line or timestamp evidence. Collect those. Run them through a formula. The number you get out is stable across reruns, calibrated, and defensible.

Seven principles.

Each one rules out a way the LLM-pick-a-number failure mode sneaks back in. Together they define a scoring pipeline you can defend against an adversary who's read the prompt.

  1. 01

    Separate observation from scoring

    The LLM finds evidence. A formula, not the LLM, produces the score. The number is computed, not chosen.

  2. 02

    Discrete signed impact items

    Every piece of evidence gets one of {+5, +3, +2, +1, −1, −2, −3, −5}. Forces commitment. Removes the 7-out-of-10 anchor.

  3. 03

    Diminishing returns (sqrt)

    normalized = net_impact / sqrt(total_items). The 40th item adds less than the 4th. Evidence farming is punished.

  4. 04

    Density-weighted confidence

    Confidence = how much evidence the scorer found, not how sure the scorer feels. Sparse runs are visibly low-confidence.

  5. 05

    Anchored center

    Sparse-evidence runs regress toward 50. The multiplier never exceeds 1.0 — high evidence confirms, never amplifies beyond raw.

  6. 06

    Bounded scale with self-check

    Final scores live in [0, 100]. Across criteria, the spread must be ≥ 20 — otherwise the evaluator was not discriminating.

  7. 07

    Separation of LLM and deterministic computation

    Independent passes by different model families collect evidence. Math, not the LLM, combines them. Adversarial synthesis catches contradictions.

Discrete impact → diminishing returns → confidence-weighted.

Two application patterns. Use the pooled variant for simple checkers (one bucket of evidence, one final score). Use the per-criterion variant for matrix benchmarks (formula runs once per criterion; weighted average produces the overall).

# Per criterion
net_impact         = sum(item.impact for item in items)
total_items        = len(items)
normalized         = net_impact / sqrt(total_items)
raw                = clamp(50 + normalized * 8.0, 0, 100)
density            = total_items / 20
multiplier         = 0.75 + 0.25 * clamp(density, 0, 1)   # never > 1.0
final              = round(50 + (raw - 50) * multiplier)
confidence         = clamp(density, 0, 1)
# Across criteria (overall)
overall_score       = round(sum(c.final * c.weight))
overall_confidence  = min(c.confidence for c in criteria)
self_check_span     = max(c.final) - min(c.final)
                      # must be >= 20

Discrete impact set: {+5, +3, +2, +1, −1, −2, −3, −5}. Hard cap of 5 items per perspective per criterion per pass. The multiplier never exceeds 1.0 (confirms, never amplifies). Sparse evidence is visibly low-confidence, not silently confident.

Calibration anchors.

The 8.0 scale factor was tuned to put scores at familiar landmarks. The numbers below are not opinions — they fall out of the formula given each normalized_impact.

normalized_impact→ raw_scoretier
+5.0~90Exceptional (5–10% of submissions)
+2.5~70Above average
0.0~50Average
−2.5~30Below average
−5.0~10Poor

Run the math by hand.

Four small examples + one real-world rescoring. The toy cases show how the pieces interact; the rescoring shows what changes when an existing benchmark drops the LLM-picks-a-number step.

  1. A

    Strong, abundant evidence

    given · net_impact = +25, total_items = 25

    • · normalized = 25 / sqrt(25) = 5.0
    • · raw = 50 + (5.0 × 8.0) = 90
    • · density = 25/20 = 1.25 → multiplier = 1.0
    • · final = 90, confidence = 1.0
  2. B

    Strong, sparse evidence

    given · net_impact = +25, total_items = 4

    • · normalized = 25 / sqrt(4) = 12.5
    • · raw = 50 + (12.5 × 8.0) = 150 → clamped 100
    • · density = 4/20 = 0.2 → multiplier = 0.80
    • · final = round(50 + 50 × 0.80) = 90, confidence = 0.2
  3. C

    Average

    given · net_impact = 0, total_items = 20

    • · normalized = 0
    • · raw = 50
    • · density = 1.0 → multiplier = 1.0
    • · final = 50, confidence = 1.0
  4. D

    Weak, well-evidenced

    given · net_impact = −15, total_items = 25

    • · normalized = −15 / sqrt(25) = −3.0
    • · raw = 50 + (−3.0 × 8.0) = 26
    • · density = 1.25 → multiplier = 1.0
    • · final = 26, confidence = 1.0 (high confidence in a low score)

Rescoring case study: pbakaus/impeccable

full analysis →

Paul Bakaus' frontend-design skill bundle scored UIs on the 10 Nielsen heuristics, 0–4 each, summed to a 0–40 band. Two judges (an LLM pass + a deterministic detector with 24 antipattern rules) ran in isolation — exactly the right architecture. But the LLM still picked the numbers. We replaced that step with signed-evidence-item collection and ran the standard formula.

vanilla 76 rescored 59 Δ 17 pts

Stable across runs. The page wasn't bad — but it wasn't 76. The vanilla score was carrying ambient charity.

Companion analysis: trycua/cua-bench

full analysis →

Computer-use agent benchmark. Reward is a deterministic float — no LLM in the scoring path, so principle 7 is satisfied by construction. The remaining opportunity is principles 2–6: signed multi-signal evidence accumulation instead of single-signal pass/fail. The honest take: not every benchmark needs the full methodology. cua-bench is a partial fit, not a slam-dunk like impeccable.

Three ways to use it.

One repo, three skills. Pick the one that matches your need. All three are skills.sh-installable into Claude Code, Cursor, Goose, OpenCode, and any other skills-aware agent.

// not sure? what-works-feedback-judge is the simplest. hackathon-judge if you have a code submission. evidence-scoring if you're bringing your own domain.

evidence-scoring

The seven-principle methodology, generic.

SKILL.md →

Bring your own domain. Define your matrix. The skill walks you through cataloging signed evidence items and runs the formula. The methodology, nothing prescribed about WHAT you score.

what-works-feedback-judge

A 4-question feedback loop. Score any draft.

SKILL.md →

Pre-baked Working / Not working / Missing / Confusing application. Hand it any draft, spec, plan, or pitch and get a 0–100 readiness score plus four grouped action lists. Iterate; v1 → v2 score delta tells you whether the revision actually moved.

hackathon-judge

Four-pass project judging — code, demo, math, mentoring.

SKILL.md →

Score any project submission with a codebase (and optional demo video) against a 5×5 evidence matrix. Independent passes prevent polish from masking thin code and vice versa. Math computes the scores; the team gets a grounded mentoring report.

The paper.

Don't Let the LLM Pick a Number — methodology paper. Calibrated on 90+ hackathon submissions (codebases and demo videos, three events) and 342 BLS occupations across 9 models. Includes the full derivation, ablations, and the impeccable rescoring case study.

Title
Don't Let the LLM Pick a Number
Status
v0.7 draft
Length
~9k words + appendices
Calibrated on
90+ hackathon submissions (3 events), 342 BLS occupations
Models tested
9 frontier models
License
MIT (markdown source)
Read the paper