Methodology

How a claim becomes a verdict.

Every claim on Whalespan passes through the same pipeline. There are no shortcuts and no exceptions.

§ 1Extraction

We ingest the source content (podcast, video, post, paper) and transcribe it where needed. A claim-extraction model surfaces every concrete proposition made by every speaker, preserving hedge language and negations. A claim might be “rapamycin extends median lifespan in mice by 9–14%” or “blue light blocking glasses improve subjective sleep quality.” Identical claims across content are deduplicated into a single canonical claim.

§ 2Paper matching

Each claim is matched against the academic literature using the Semantic Scholar Graph API, ranked by relevance and citation graph proximity. We require at least three candidate papers before a claim is gradable. If we can’t find papers, we mark the claim “Limited Research” and tell you that — we don’t guess.

§ 3Dual grader (N=2)

Two independent LLM graders — Claude and GPT — read the claim alongside its top-ranked papers. Each grader independently assigns an evidence grade and a risk level. A verdict is published only when both graders agree on both dimensions. Disagreement routes to a human review queue.

§ 4Verdict

The two grades resolve to one of four verdicts:

WELL SUPPORTED — multi-study consensus, clean mechanism, low risk
PARTIALLY SUPPORTED — directional evidence, caveats apply
LIMITED RESEARCH — insufficient literature to grade
NOT SUPPORTED — contradicted by the literature

§ 5Evidence Score

A per-claim 0–100 score derived from the strength and convergence of supporting papers, the presence of contradicting evidence, and the sample sizes involved. The score is a summary of the literature, not a vote.

§ 6Evaluation gates

Before any change to the grader prompts ships, we re-run three gold sets:

Grading gold set — 100 claims hand-graded by the founder and a contract labeler. Target: ≥85% N=2 agreement.
Extraction gold set — 30–50 transcript segments. Tests whether hedge preservation and negation survive extraction.
Paper-match gold set — 50 claims with hand-curated top-3 papers. Target: ≥70% recall@3. If we can’t beat 70%, we show “papers being researched” rather than ship a wrong citation.

§ 7Track records

A speaker’s Evidence Score is the rolling average of the evidence scores of every claim they have made on Whalespan-graded content. It is a function of how the claims survive the papers, not a function of the speaker. We do not adjust a track record based on personal characteristics, political alignment, employer, or anything else.

§ 8When a verdict changes

The literature evolves. A claim graded “Partially Supported” in 2026 may become “Well Supported” by 2028, or vice versa. When a verdict changes, we mark the change date and the paper(s) that triggered it, and we surface the change to anyone who has saved or read that claim.

This is not a finished product. The pipeline above will get sharper. When we change a meaningful piece of it, we will tell you what changed and what re-graded.