CapitalBench Methodology

What CapitalBench Measures

CapitalBench evaluates one narrow question: given the same frozen market context, which single market option does a model choose for the next month? It is a benchmark for reproducible model comparison, not a trading system, portfolio optimizer, or investment recommendation engine.

Models may use internal learned knowledge and general market priors. They do not have to behave like blank slates. The controlled part is the externally supplied information: every model receives the same prompt, briefing, option universe, and optional mechanical market-data table.

No combined mega-score. Official one-shot results, cumulative official results, and repeated-run stability results are separate views. CapitalBench does not blend them into a weighted headline score.

Round Lifecycle

Create the round directory. Each round lives under rounds/<round_id>/ with a manifest, prompt, briefing, option universe, price files, hashes, and isolated run folders.
Prepare the model-facing material. Operators write or import the factual briefing, exact prompt, and frozen option list.
Validate the universe. Non-cash ETF tickers must pass Tiingo EOD validation before a public round is frozen.
Add optional trailing-return context. A mechanical full-universe return table may be generated from adjusted closes and sorted by option order.
Freeze the inputs. capitalbench hash-round writes SHA-256 hashes for the round inputs before submissions are collected.
Collect model calls before the deadline. Official runs use one call per model. Stability runs use repeated calls and a different run_id.
Validate submissions. Raw provider responses are preserved. Only schema-valid submissions move into parsed submissions and scoring.
Fetch prices and score after resolution. Entry prices can be collected while a round is pending; exit prices are collected only after the manifest horizon resolves.
Publish normalized artifacts. Public results, hashes, reports, and leaderboard rows are synced to the website read model when Supabase sync is configured.

Round Artifacts

The local round directory is the audit source of truth. Supabase stores normalized published copies for the website, but the canonical record remains the hashed round artifact set.

Artifact	Audience	Purpose
`manifest.yaml`	Public	Round metadata, decision deadline, entry rule, exit rule, horizon, and methodology version.
`prompt.md`	Model-facing	The exact task instruction sent to every model.
`briefing.md`	Model-facing	Neutral factual context available at decision time.
`options.yaml`	Model-facing	The only valid choices. Each public submission must select exactly one option id.
`market_data/universe_trailing_returns.*`	Model-facing when present	Mechanical 7-day, 30-day, 6-month, and 1-year trailing returns from adjusted close data.
`hashes.json`	Public audit	SHA-256 hashes proving the frozen input files used for the round.
`research/*`	Audit, except final briefing	Research manifest, hashes, source fact report, audit report, and final model-facing briefing.
`runs/<run_id>/*`	Audit and scoring	Raw responses, normalized raw payloads, parsed submissions, run logs, validation summaries, and results.

Research And Briefing Rules

Deep research output is stored as audit material first. The only research artifact copied into the model prompt is research/final_briefing.md, which becomes round-level briefing.md. Market fact reports, source ledgers, and briefing audit reports remain audit-only.

The model-facing briefing should include facts, dates, values, forecasts labeled as forecasts, scheduled catalysts, and source-reported uncertainties. It should not include opinion, interpretation, scenario analysis, "why it matters" commentary, affected-market mapping, recommendations, or option rankings.

capitalbench import-research \
  --round rounds/<id> \
  --market-fact-report market_fact_report.md \
  --audit-report briefing_audit_report.md \
  --final-briefing final_briefing.md \
  --research-cutoff-utc "YYYY-MM-DDTHH:MM:SSZ"

Universe Policy

Public rounds use CapitalBench Universe v1.5: a fixed ETF universe plus CASH. The model sees readable option ids, names, public symbols, asset classes, categories, groups, risk buckets, and exposure descriptions. Internal fields and provider-specific data-fetching fields are kept out of the prompt.

All non-cash options are US-listed ETF tickers and must validate against Tiingo EOD data before the round is frozen. CASH has no ticker and is skipped during Tiingo validation.

capitalbench validate-universe \
  --round rounds/<id> \
  --start-date YYYY-MM-DD \
  --end-date YYYY-MM-DD

Submission Schema

Each model must return one JSON or YAML object. A submission with multiple selected assets is invalid. Invalid raw responses remain preserved, but they are not scored and cannot enter a public official leaderboard.

round_id model_id provider mode: closed_capability run_type replicate_index replicate_count is_official_score selected_option_id confidence rationale_summary key_risks

Official runs require replicate_index: 1, replicate_count: 1, and is_official_score: true. Stability runs use repeated replicate indexes for each model and require is_official_score: false.

Model Execution Policy

Official

One call per model

The official result is the selected asset from one valid provider call in the selected official run.

Stability

Repeated calls

Stability runs ask the same model the same question multiple times to measure decision consistency.

Every model receives the same prompt, briefing, option universe, and frozen market-data artifact.
Temperature is set to 0 where supported.
Tools, browsing, web search, code execution, and external retrieval are disabled at the API level where supported.
Live market data and intentional use of post-cutoff facts, prices, news, or events are not allowed.
Reasoning or thinking is set to the lowest provider-supported setting that still allows valid structured output.
Hidden reasoning tokens are recorded when exposed but are not treated as directly comparable across providers.
Provider token usage, latency, and cost are logged when available.

capitalbench run-round \
  --round rounds/<id> \
  --models configs/models.local.yaml \
  --run-id official-YYYYMMDD \
  --run-type official \
  --allow-real-api-calls

Retry Policy

An official retry is allowed only when no valid decision can be parsed because of infrastructure or format failure: malformed JSON, truncated response, provider transport or API failure, or schema output failure.

A retry is not allowed because of the selected asset, confidence value, or rationale quality. Failed raw responses must remain in the run artifacts and must stay ineligible for public official scoring.

Pricing And Scoring

CapitalBench scores valid submissions against local price files. Adjusted close is preferred. If only close is supplied, scoring may continue but records a warning in the result artifacts. Tiingo fetching is strict about dates and requires rows matching the manifest entry and exit dates.

Selected return exit_price / entry_price - 1

Alpha versus S&P 500 selected_return - sp500_return

Regret versus best option best_option_return - selected_return

Alpha per dollar alpha_vs_sp500 / cost_usd

Cash is treated as a zero return unless cash prices are explicitly supplied. The main official leaderboard is sorted by alpha versus S&P 500 descending. Ties are resolved by lower regret, higher confidence, and then model id.

capitalbench fetch-prices \
  --round rounds/<id> \
  --run-id <run_id> \
  --entry-date YYYY-MM-DD \
  --exit-date YYYY-MM-DD \
  --full-universe

Leaderboard Definitions

Latest Round Leaderboard

Newest resolved official one-shot run only. If the latest round is pending, picks may be shown but performance is withheld.

Cumulative Official Leaderboard

Average official alpha versus S&P 500 across resolved rounds where each model has an official result. New models are not backfilled into old rounds.

Cumulative Stability Leaderboard

Average repeated-run alpha and average consistency across resolved stability runs. This view stays separate from official scoring.

Fairness Controls

Freeze manifest.yaml, briefing.md, options.yaml, prompt.md, and any prompt-facing market_data/ artifact before collection.
Publish hashes.json so readers can verify that inputs did not change after model calls.
Do not give one model extra context or a different option universe.
Do not revise a completed round after seeing model choices just to create more varied picks.
Preserve invalid raw submissions instead of deleting them.
Use the same entry and exit price source for every option.
Identify exactly one official one-shot run for public reporting.
Exclude mock, provider-smoke, failed, incomplete, and retrospective runs from public official leaderboards.
Add newly released models only to future rounds. Do not rerun old rounds after outcomes may be knowable.

Auditability And Website Sync

The CLI stores exact provider text in local raw_responses/ sidecars, normalized payloads in submissions/raw/, validated submissions in submissions/parsed/, and SHA-256 paths in run_log.jsonl. Raw provider text and private smoke-test output are excluded from the public repo; reports, validation summaries, result CSVs, and public hashes are generated from the sanitized artifacts.

When the Supabase URL and server-side service-role credentials are configured, publish and scoring commands can sync normalized public rows to Supabase for website rendering. The public frontend uses only the Supabase anon key. Service-role keys, provider keys, and market-data keys are never exposed in built assets.

capitalbench sync-web --round rounds/<round_id> --run-id <run_id>
capitalbench sync-web --rounds-dir rounds --include-cumulative

Limitations

CapitalBench measures one prompt, one option set, and one time window at a time. It does not model taxes, transaction costs, slippage, liquidity, dividends, position sizing, or portfolio construction. A one-month result can be dominated by noise, and a round where many models choose the same asset can be fair but low-discrimination as a ranking event.

The framework is useful for reproducible comparison, but it should not be read as proof that a model has durable investing skill. The benchmark output is research, not financial advice.

A Frozen One-Shot Market Benchmark

Internal learned knowledge

Browsing and live retrieval

One official selected option

Private and invalid attempts