Internal learned knowledge
Models may use general priors already inside the model. Every model receives the same external context.
Methodology
Every model gets the same market brief and asset list. Each model submits one portfolio. Market prices decide the score after one week or one month.
Each model submits one valid portfolio or single-allocation response, depending on the round rules.
No browsing, tools, live retrieval, or extra market data during the model call.
CapitalBench Score compares the portfolio return with the highest-returning scored option in that same window.
Models may use general priors already inside the model. Every model receives the same external context.
Tools, search, live prices, and intentional use of post-cutoff facts are disabled or disallowed.
The benchmark result uses one valid frozen portfolio per model from the selected public run.
Mock, provider-smoke, retrospective, failed, incomplete, and invalid submissions are excluded.
CapitalBench evaluates one narrow question: given the same market information and asset list, which portfolio does each model allocate for a one-week or one-month round? It is a benchmark for comparing model portfolios, not a trading system or investment recommendation engine.
Models may use internal learned knowledge and general market priors. They do not have to behave like blank slates. The controlled part is the externally supplied information: every model receives the same prompt, briefing, asset list, and optional mechanical market-data table.
rounds/<round_id>/ with its rules, prompt, briefing, asset list, price files, hashes, and run folders.capitalbench hash-round writes SHA-256 hashes for the test inputs before submissions are collected.run_id.The local test directory is the proof source of truth. Supabase stores normalized published copies for the website, but the canonical record remains the hashed test files.
| Artifact | Audience | Purpose |
|---|---|---|
manifest.yaml | Public | Round metadata, allocation deadline, start rule, end rule, time period, methodology version, submission format, and portfolio constraints. |
prompt.md | Model-facing | The exact task instruction sent to every model. |
briefing.md | Model-facing | Neutral factual context available at decision time. |
options.yaml | Model-facing | The only valid choices. Each public submission must use option ids from this saved asset list. |
submission_schema.json | Model-facing when present | The machine-readable response contract for the round's declared submission format. |
market_data/universe_trailing_returns.* | Model-facing when present | Mechanical 7-day, 30-day, 6-month, and 1-year trailing returns from adjusted close data. |
hashes.json | Public audit | SHA-256 hashes proving the saved input files used for the round. |
research/* | Audit, except final briefing | Research manifest, hashes, source fact report, audit report, and final model-facing briefing. |
runs/<run_id>/* | Audit and scoring | Raw responses, normalized raw payloads, parsed submissions, run logs, validation summaries, and results. |
Deep research output is stored as proof material first. The only research artifact copied into the model
prompt is research/final_briefing.md, which becomes test-level briefing.md.
Market fact reports, source ledgers, and briefing audit reports remain proof-only.
The model-facing briefing should include facts, dates, values, forecasts labeled as forecasts, scheduled catalysts, and source-reported uncertainties. It should not include opinion, interpretation, scenario analysis, "why it matters" commentary, affected-market mapping, recommendations, or option rankings.
capitalbench import-research \
--round rounds/<id> \
--market-fact-report market_fact_report.md \
--audit-report briefing_audit_report.md \
--final-briefing final_briefing.md \
--research-cutoff-utc "YYYY-MM-DDTHH:MM:SSZ" Public tests declare an asset-list version in the test manifest and save that exact option file before model calls. New tests use the current configured universe shown on the Asset List page; as of Universe v2.1, that list has 70 choices. Earlier tests stay tied to the exact list the models saw, including v1.5 with 40 choices and v2.0 with 65 choices.
The model sees readable option ids, names, public symbols, asset classes, categories, groups, risk buckets, and exposure descriptions. Internal fields and provider-specific data-fetching fields are kept out of the prompt.
All non-cash options are public tickers that must validate against Tiingo EOD data before the test is saved. CASH has no ticker and is skipped during Tiingo validation. Because prompt context may include 7-day, 30-day, six-month, and one-year trailing returns, the pre-test data check should cover the full lookback window needed for the test.
capitalbench validate-universe \
--round rounds/<id> \
--start-date YYYY-MM-DD \
--end-date YYYY-MM-DD Each model must return one JSON or YAML object matching the round's declared submission format. Invalid raw responses remain preserved, but they are not scored and cannot enter a public benchmark result.
round_id model_id provider mode: closed_capability run_type replicate_index replicate_count is_official_score confidence rationale_summary key_risks Requires one selected_option_id from options.yaml. Legacy multi-select fields are invalid.
Requires a portfolio array plus portfolio_rationale. The default protocol allows 1 to 5 holdings, 5% increments, and exactly 100% total allocation.
manifest.yaml freezes the exact submission format, holding limits, allocation increment, and cash or benchmark allowance before model calls begin.
Public runs require replicate_index: 1, replicate_count: 1, and
is_official_score: true. Consistency runs use repeated replicate indexes for each model and require
is_official_score: false.
The public result is one valid provider call in the selected public run, scored under the round's declared submission format.
Consistency runs ask the same model the same question multiple times to measure allocation consistency.
0 where supported.capitalbench run-round \
--round rounds/<id> \
--models configs/models.local.yaml \
--run-id official-YYYYMMDD \
--run-type official \
--allow-real-api-calls A public retry is allowed only when no valid allocation can be parsed because of infrastructure or format failure: malformed JSON, truncated response, provider transport or API failure, or schema output failure.
A retry is not allowed because of the selected asset, confidence value, or rationale quality. Failed raw responses must remain in the run artifacts and must stay ineligible for public scoring.
CapitalBench scores valid submissions against local price files. Adjusted close is preferred. If only close is supplied, scoring may continue but records a warning in the result artifacts. Tiingo fetching is strict about dates and requires rows matching the manifest start and end dates. Final automated scoring refreshes both start and end adjusted closes together after the round window ends, so ETF distribution adjustments are on one post-window price basis.
ending_price / starting_price - 1 sum(weight * option_return) portfolio_return - sp500_return max(scored_universe_returns) 100 * portfolio_return / oracle_return max_possible_return - portfolio_return alpha_vs_sp500 / cost_usd Cash is treated as a zero return unless cash prices are explicitly supplied. Portfolio tests use the weighted realized return of the submitted portfolio. Latest-test tables show raw portfolio return, S&P 500 return, Portfolio Minus S&P 500, and regret. Combined track scorecards use CapitalBench Score, where 100 means matching the highest-returning scored option, 0 means no net return, and negative values preserve losses. Overall scores divide summed model returns by summed oracle returns.
Benchmark Comparison Sets are the fair headline ranking system. A set has a fixed model roster and includes only resolved rounds where every model in that roster has an official result. If one model misses a round, that round is excluded from the set for everyone.
All-available history remains useful context, but it is not the headline fair ranking when model histories differ. Weekly comparison sets become the Current Benchmark at 6 shared resolved rounds. Monthly comparison sets become the Current Benchmark at 3 shared resolved rounds.
Interim tracking can be generated from reusable price snapshots after the start date and before the final end date. These rows are shown on round pages as provisional progress versus S&P 500 only; they do not finish a round, populate the final benchmark result, or create rank/regret metrics.
capitalbench fetch-prices \
--round rounds/<id> \
--run-id <run_id> \
--entry-date YYYY-MM-DD \
--exit-date YYYY-MM-DD \
--full-universe A fixed model roster scored only on shared rounds completed by every model in the set.
The newest qualified comparison set for a track. Weekly qualifies at 6 shared rounds; monthly qualifies at 3.
If a set model misses a resolved round, that round is excluded from that set for every model.
Newest completed public run only. If the latest round is pending, model portfolios may be shown but final performance is withheld.
Context across all resolved rounds in a track. This view can include unequal model histories.
Average repeated-run score and average consistency across completed consistency runs. This view stays separate from public scoring.
manifest.yaml, briefing.md, options.yaml, prompt.md, and any prompt-facing market_data/ artifact before collection.hashes.json so readers can verify that inputs did not change after model calls.
The CLI stores exact provider text in local raw_responses/ sidecars, normalized payloads in
submissions/raw/, validated submissions in submissions/parsed/, and SHA-256 paths in
run_log.jsonl. Raw provider text and private smoke-test output are excluded from the public repo;
reports, validation summaries, result CSVs, and public hashes are generated from the sanitized artifacts.
When the Supabase URL and server-side service-role credentials are configured, publish and scoring commands can sync normalized public rows to Supabase for website rendering. The public frontend uses only the Supabase anon key. Service-role keys, provider keys, and market-data keys are never exposed in built assets.
capitalbench sync-web --round rounds/<round_id> --run-id <run_id>
capitalbench sync-web --rounds-dir rounds --include-cumulative CapitalBench measures one prompt, one asset list, and one time window at a time. It does not model taxes, transaction costs, slippage, liquidity, dividends, or real-world execution constraints. A one-week or one-month result can be dominated by noise, and a test where many models choose the same asset can be fair but low-discrimination as a ranking event.
The framework is useful for reproducible comparison, but it should not be read as proof that a model has durable investing skill. The benchmark output is research, not financial advice.