Model task Submit one portfolio

Each model submits one valid portfolio or single-allocation response, depending on the round rules.

Information rule Same information

No browsing, tools, live retrieval, or extra market data during the model call.

Scoring rule Best-asset scale

CapitalBench Score compares the portfolio return with the highest-returning scored option in that same window.

Allowed

Internal learned knowledge

Models may use general priors already inside the model. Every model receives the same external context.

Forbidden

Browsing and live retrieval

Tools, search, live prices, and intentional use of post-cutoff facts are disabled or disallowed.

Scored

One public portfolio

The benchmark result uses one valid frozen portfolio per model from the selected public run.

Not scored

Private and invalid attempts

Mock, provider-smoke, retrospective, failed, incomplete, and invalid submissions are excluded.

What CapitalBench Measures

CapitalBench evaluates one narrow question: given the same market information and asset list, which portfolio does each model allocate for a one-week or one-month round? It is a benchmark for comparing model portfolios, not a trading system or investment recommendation engine.

Models may use internal learned knowledge and general market priors. They do not have to behave like blank slates. The controlled part is the externally supplied information: every model receives the same prompt, briefing, asset list, and optional mechanical market-data table.

No combined weekly-monthly score. One-week and one-month tests stay separate because they measure different time periods.

Test Steps

  1. Create the test directory. Each test lives under rounds/<round_id>/ with its rules, prompt, briefing, asset list, price files, hashes, and run folders.
  2. Prepare the model-facing material. Operators write or import the factual briefing, exact prompt, and saved asset list.
  3. Validate the asset list. Non-cash ETF tickers must pass Tiingo EOD validation before a public test is saved.
  4. Add optional return context. A mechanical full-asset-list return table may be generated from adjusted closes and sorted by option order.
  5. Save the inputs. capitalbench hash-round writes SHA-256 hashes for the test inputs before submissions are collected.
  6. Collect model calls before the deadline. Public runs use one call per model. Consistency checks use repeated calls and a different run_id.
  7. Validate submissions. Raw provider responses are preserved. Only schema-valid submissions move into parsed submissions and scoring.
  8. Fetch prices and score after the test ends. Starting prices can be collected while a test is pending; ending prices are collected only after the test period is over.
  9. Publish normalized artifacts. Public results, hashes, reports, and leaderboard rows are synced to the website read model when Supabase sync is configured.

Test Files

The local test directory is the proof source of truth. Supabase stores normalized published copies for the website, but the canonical record remains the hashed test files.

Artifact Audience Purpose
manifest.yaml Public Round metadata, allocation deadline, start rule, end rule, time period, methodology version, submission format, and portfolio constraints.
prompt.md Model-facing The exact task instruction sent to every model.
briefing.md Model-facing Neutral factual context available at decision time.
options.yaml Model-facing The only valid choices. Each public submission must use option ids from this saved asset list.
submission_schema.json Model-facing when present The machine-readable response contract for the round's declared submission format.
market_data/universe_trailing_returns.* Model-facing when present Mechanical 7-day, 30-day, 6-month, and 1-year trailing returns from adjusted close data.
hashes.json Public audit SHA-256 hashes proving the saved input files used for the round.
research/* Audit, except final briefing Research manifest, hashes, source fact report, audit report, and final model-facing briefing.
runs/<run_id>/* Audit and scoring Raw responses, normalized raw payloads, parsed submissions, run logs, validation summaries, and results.

Research And Briefing Rules

Deep research output is stored as proof material first. The only research artifact copied into the model prompt is research/final_briefing.md, which becomes test-level briefing.md. Market fact reports, source ledgers, and briefing audit reports remain proof-only.

The model-facing briefing should include facts, dates, values, forecasts labeled as forecasts, scheduled catalysts, and source-reported uncertainties. It should not include opinion, interpretation, scenario analysis, "why it matters" commentary, affected-market mapping, recommendations, or option rankings.

capitalbench import-research \
  --round rounds/<id> \
  --market-fact-report market_fact_report.md \
  --audit-report briefing_audit_report.md \
  --final-briefing final_briefing.md \
  --research-cutoff-utc "YYYY-MM-DDTHH:MM:SSZ"

Asset List Policy

Public tests declare an asset-list version in the test manifest and save that exact option file before model calls. New tests use the current configured universe shown on the Asset List page; as of Universe v2.1, that list has 70 choices. Earlier tests stay tied to the exact list the models saw, including v1.5 with 40 choices and v2.0 with 65 choices.

The model sees readable option ids, names, public symbols, asset classes, categories, groups, risk buckets, and exposure descriptions. Internal fields and provider-specific data-fetching fields are kept out of the prompt.

All non-cash options are public tickers that must validate against Tiingo EOD data before the test is saved. CASH has no ticker and is skipped during Tiingo validation. Because prompt context may include 7-day, 30-day, six-month, and one-year trailing returns, the pre-test data check should cover the full lookback window needed for the test.

capitalbench validate-universe \
  --round rounds/<id> \
  --start-date YYYY-MM-DD \
  --end-date YYYY-MM-DD

Submission Format

Each model must return one JSON or YAML object matching the round's declared submission format. Invalid raw responses remain preserved, but they are not scored and cannot enter a public benchmark result.

round_id model_id provider mode: closed_capability run_type replicate_index replicate_count is_official_score confidence rationale_summary key_risks

Single Allocation

Requires one selected_option_id from options.yaml. Legacy multi-select fields are invalid.

Portfolio

Requires a portfolio array plus portfolio_rationale. The default protocol allows 1 to 5 holdings, 5% increments, and exactly 100% total allocation.

Round Constraints

manifest.yaml freezes the exact submission format, holding limits, allocation increment, and cash or benchmark allowance before model calls begin.

Public runs require replicate_index: 1, replicate_count: 1, and is_official_score: true. Consistency runs use repeated replicate indexes for each model and require is_official_score: false.

Model Execution Policy

Public run

One call per model

The public result is one valid provider call in the selected public run, scored under the round's declared submission format.

Consistency

Repeated calls

Consistency runs ask the same model the same question multiple times to measure allocation consistency.

  • Every model receives the same prompt, briefing, asset list, and saved market-data artifact.
  • Temperature is set to 0 where supported.
  • Tools, browsing, web search, code execution, and external retrieval are disabled at the API level where supported.
  • Live market data and intentional use of post-cutoff facts, prices, news, or events are not allowed.
  • Reasoning or thinking is set to the lowest provider-supported setting that still allows valid structured output.
  • Hidden reasoning tokens are recorded when exposed but are not treated as directly comparable across providers.
  • Provider token usage, latency, and cost are logged when available.
capitalbench run-round \
  --round rounds/<id> \
  --models configs/models.local.yaml \
  --run-id official-YYYYMMDD \
  --run-type official \
  --allow-real-api-calls

Retry Policy

A public retry is allowed only when no valid allocation can be parsed because of infrastructure or format failure: malformed JSON, truncated response, provider transport or API failure, or schema output failure.

A retry is not allowed because of the selected asset, confidence value, or rationale quality. Failed raw responses must remain in the run artifacts and must stay ineligible for public scoring.

Pricing And Scoring

CapitalBench scores valid submissions against local price files. Adjusted close is preferred. If only close is supplied, scoring may continue but records a warning in the result artifacts. Tiingo fetching is strict about dates and requires rows matching the manifest start and end dates. Final automated scoring refreshes both start and end adjusted closes together after the round window ends, so ETF distribution adjustments are on one post-window price basis.

Single-allocation return ending_price / starting_price - 1
Portfolio return sum(weight * option_return)
Return versus S&P 500 portfolio_return - sp500_return
Maximum possible return max(scored_universe_returns)
CapitalBench Score 100 * portfolio_return / oracle_return
Regret versus max possible max_possible_return - portfolio_return
S&P difference per dollar alpha_vs_sp500 / cost_usd

Cash is treated as a zero return unless cash prices are explicitly supplied. Portfolio tests use the weighted realized return of the submitted portfolio. Latest-test tables show raw portfolio return, S&P 500 return, Portfolio Minus S&P 500, and regret. Combined track scorecards use CapitalBench Score, where 100 means matching the highest-returning scored option, 0 means no net return, and negative values preserve losses. Overall scores divide summed model returns by summed oracle returns.

Benchmark Comparison Sets are the fair headline ranking system. A set has a fixed model roster and includes only resolved rounds where every model in that roster has an official result. If one model misses a round, that round is excluded from the set for everyone.

All-available history remains useful context, but it is not the headline fair ranking when model histories differ. Weekly comparison sets become the Current Benchmark at 6 shared resolved rounds. Monthly comparison sets become the Current Benchmark at 3 shared resolved rounds.

Interim tracking can be generated from reusable price snapshots after the start date and before the final end date. These rows are shown on round pages as provisional progress versus S&P 500 only; they do not finish a round, populate the final benchmark result, or create rank/regret metrics.

capitalbench fetch-prices \
  --round rounds/<id> \
  --run-id <run_id> \
  --entry-date YYYY-MM-DD \
  --exit-date YYYY-MM-DD \
  --full-universe

Benchmark Comparison Sets

Comparison Set

A fixed model roster scored only on shared rounds completed by every model in the set.

Current Benchmark

The newest qualified comparison set for a track. Weekly qualifies at 6 shared rounds; monthly qualifies at 3.

Excluded For Fairness

If a set model misses a resolved round, that round is excluded from that set for every model.

Benchmark Result Definitions

Latest Round Result

Newest completed public run only. If the latest round is pending, model portfolios may be shown but final performance is withheld.

All Available History

Context across all resolved rounds in a track. This view can include unequal model histories.

Consistency Result

Average repeated-run score and average consistency across completed consistency runs. This view stays separate from public scoring.

Fairness Controls

  • Freeze manifest.yaml, briefing.md, options.yaml, prompt.md, and any prompt-facing market_data/ artifact before collection.
  • Publish hashes.json so readers can verify that inputs did not change after model calls.
  • Do not give one model extra context or a different asset list.
  • Do not revise a completed round after seeing model choices just to create more varied allocations.
  • Preserve invalid raw submissions instead of deleting them.
  • Use the same starting and ending price source for every option.
  • Identify exactly one public model-call run for each round's public benchmark result.
  • Exclude mock, provider-smoke, failed, incomplete, and retrospective runs from public benchmark results.
  • Add newly released models only to future rounds. Do not rerun old rounds after outcomes may be knowable.

Audit Packets And Website Sync

The CLI stores exact provider text in local raw_responses/ sidecars, normalized payloads in submissions/raw/, validated submissions in submissions/parsed/, and SHA-256 paths in run_log.jsonl. Raw provider text and private smoke-test output are excluded from the public repo; reports, validation summaries, result CSVs, and public hashes are generated from the sanitized artifacts.

When the Supabase URL and server-side service-role credentials are configured, publish and scoring commands can sync normalized public rows to Supabase for website rendering. The public frontend uses only the Supabase anon key. Service-role keys, provider keys, and market-data keys are never exposed in built assets.

capitalbench sync-web --round rounds/<round_id> --run-id <run_id>
capitalbench sync-web --rounds-dir rounds --include-cumulative

Limitations

CapitalBench measures one prompt, one asset list, and one time window at a time. It does not model taxes, transaction costs, slippage, liquidity, dividends, or real-world execution constraints. A one-week or one-month result can be dominated by noise, and a test where many models choose the same asset can be fair but low-discrimination as a ranking event.

The framework is useful for reproducible comparison, but it should not be read as proof that a model has durable investing skill. The benchmark output is research, not financial advice.