PRIVATE CAPITALBENCH EVALS

Benchmark your AI investment system against frontier models.

CapitalBench independently evaluates your model, agent, or workflow using the same frozen market brief, portfolio constraints, and outcome window as leading AI systems.

You receive a private comparative scorecard, consistency analysis, failure-mode review, machine-readable results, and a complete audit packet.

Private by default · No source code or model weights required · No real capital traded

Request a Private Eval View Sample Report

Launch price: $3,000 USD · Typical completion: 10-14 calendar days · Fixed scope

Same frozen inputs

Every system receives the same approved evaluation packet.

Current comparators

Compare against up to four supported frontier models.

Real outcomes

Official performance resolves using actual market prices.

Consistency testing

Repeated runs reveal unstable or highly variable decisions.

Auditable files

Inputs, outputs, timestamps, hashes, prices, and calculations are preserved.

Decision evidence

Make a consequential model decision with evidence.

A model can produce an impressive investment thesis and still make unstable, overconcentrated, expensive, or poorly calibrated allocation decisions.

A Private CapitalBench Eval turns one specific model configuration into a documented decision record. It shows what the system selected, how consistently it selected it, what risk it took, how it compared with frontier alternatives, and how the allocation performed after the market window closed.

Choose the right model

Compare your current system with frontier alternatives under identical conditions.

Validate a release

Test the exact model, prompt, workflow, and configuration you are considering shipping.

Diagnose behavior

Identify concentration, instability, invalid outputs, excessive risk-taking, weak diversification, or poor cost-performance trade-offs.

Support internal claims

Give product leaders, investors, research teams, and customers evidence that is more defensible than screenshots or selected examples.

Standard sprint scope

One defined system. One controlled evaluation. No moving goalposts.

Scope item	Standard sprint
Client system	One model, agent, workflow, or configuration
Comparator roster	Up to four currently supported frontier models
Official evaluation	One live weekly CapitalBench allocation round
Consistency evaluation	Five separate non-official repeated runs per system
Asset universe	Current CapitalBench public ETF and cash universe
Portfolio format	One to five holdings, fixed allocation increments, 100% total
Information access	Same frozen brief, prompt, universe, and market-data packet
Base evaluation mode	Browsing, tools, and live retrieval disabled
Performance window	Approximately one week
Final output	Report, scorecard, data export, audit bundle, and readout
Confidentiality	Private unless the client approves publication in writing

Evaluation modes

Controlled mode versus production mode

A sprint may include both modes, but their scores are never mixed into one misleading comparison.

Included in the $3,000 sprint

Controlled Benchmark Mode

The system receives the same frozen external information as every comparator. Browsing, search, code execution, live prices, and external retrieval are disabled where technically possible.

This mode isolates allocation judgment and creates the fairest comparison with the public CapitalBench benchmark.

Optional add-on

Production Workflow Mode

The system runs using its normal retrieval, tools, prompts, or internal workflow.

Production-mode results are reported separately because a tool-enabled agent should not be placed on the same headline ranking as a closed-capability model receiving only frozen inputs.

Evaluation Plan

The rules are approved and frozen before any official run.

Every buyer receives a short Evaluation Plan before execution.

Field	What is frozen
System under test	Model name, version, endpoint, prompt configuration, and relevant settings
Primary business question	The decision the company wants the evaluation to support
Comparator roster	Exact comparator models and versions
Evaluation mode	Controlled or production workflow
Input packet	Prompt, briefing, market data, asset universe, and schema
Portfolio constraints	Holdings limit, allocation increments, cash rules, and total weight
Official-run policy	Number of official calls and allowed retry conditions
Consistency protocol	Number of repeats and settings
Market window	Entry rule, exit rule, benchmark, and date fallback
Metrics	Performance, consistency, validity, behavior, latency, and cost fields
Success thresholds	Any agreed internal release or acceptance thresholds
Confidentiality	Private, anonymized, or approved for public publication
Retention	Credential, raw-output, and report retention periods

Nothing material is changed after the official outputs have been collected. If a rule must change because of a technical issue, the change is documented and approved before the market outcome is known.

Process

From system access to final report

1 Before purchase

Fit review

You submit your system type, business question, preferred access method, and desired timing. CapitalBench confirms whether the system can be evaluated fairly.

Client time: approximately 15-20 minutes
Output: fit confirmation and proposed scope

2 Day 1

Evaluation design

A kickoff call defines the tested configuration, comparators, success criteria, security requirements, and evaluation mode.

Client time: approximately 45 minutes
Output: draft Evaluation Plan

3 Day 1-2

Technical connection

The system is connected through a customer-hosted endpoint, temporary API credential, or customer-executed runner. A non-scored validation call confirms schema compatibility and access.

Client time: approximately 30-60 minutes with a technical contact
Output: verified connection without consuming the official run

4 Day 2-3

Freeze and official execution

The prompt, briefing, asset universe, constraints, model roster, and scoring rules are finalized and hashed. Every system receives the approved packet. One official response is collected from each system. Invalid and failed responses remain part of the audit record.

Client time: no live meeting required after approval
Output: frozen allocations and audit hashes

5 Day 2-3

Consistency runs

The same task is repeated under the pre-approved consistency protocol. These runs measure whether each system maintains a similar allocation or changes materially from one attempt to the next.

Client time: no live meeting required
Output: consistency and allocation-dispersion dataset

6 Approximately seven calendar days

Live outcome window

The official portfolios remain frozen while market prices resolve. Interim mark-to-market data may be shown privately, but no final ranking is issued before the scheduled exit rule.

Client time: none unless a data exception needs approval
Output: final entry and exit price records

7 One to two business days after the exit price

Scoring and analysis

CapitalBench calculates returns, benchmark differences, maximum-possible context, consistency, concentration, risk, validity, cost, latency, and performance attribution.

Client time: review questions only
Output: final scorecard and report

8 Final delivery

Executive readout

CapitalBench walks the team through the findings, limitations, failure modes, and recommended next actions.

Client time: one 60-minute readout
Output: readout and seven days of written follow-up questions

Metrics and scoring

Performance is only one part of the scorecard.

Metric	What it answers
Portfolio return	How did the frozen allocation perform during the defined outcome window?
S&P 500 difference	Did it outperform or underperform the benchmark, and by how much?
CapitalBench Score	How did the return compare with the highest-returning eligible option in the same window?
Regret versus maximum	How much return separated the portfolio from the hindsight-best eligible asset?
Valid-output rate	Did the system reliably return a complete, schema-valid allocation?
Allocation consistency	Did repeated runs produce similar holdings and weights?
Concentration	How much capital was placed in the largest holding or narrowest theme?
Risk appetite	How strongly did the portfolio lean toward growth, momentum, cyclicality, defense, cash, or other risk profiles?
Peer similarity	Did the system follow the frontier-model consensus or take a distinctive position?
Explanation quality	Did its stated rationale match its allocation, identify relevant risks, and avoid unsupported claims?
Latency	How long did the system take to return a valid response?
Cost per valid run	What did each completed response cost when provider usage information was available?

See calculation details

Portfolio return = sum of each holding's weight multiplied by its realized return.

Portfolio minus S&P 500 = portfolio return minus S&P 500 return.

CapitalBench Score = portfolio return relative to the hindsight-best eligible asset return in that same window.

Regret = hindsight-best eligible asset return minus portfolio return.

Consistency = similarity of holdings and weights across repeated identical tasks.

Performance, consistency, cost, and latency remain visible as separate fields. CapitalBench does not hide materially different trade-offs inside one unexplained composite grade.

Deliverables

Everything you receive

Approved Evaluation Plan

The exact system, comparator roster, inputs, constraints, retry policy, metrics, timeline, confidentiality setting, and scoring rules.

PDF or Markdown

Executive Evaluation Report

A decision-focused report explaining the result, what it means, major risks, limitations, and recommended next actions.

12-20 page PDF

Comparative Scorecard

Side-by-side results for the client system and every comparator, including performance, consistency, validity, behavior, latency, and cost where available.

PDF table, CSV, and JSON

Performance Attribution

Holding-level explanation of which allocations helped or hurt the official result, with entry and exit prices and comparator context.

Report section and CSV

Consistency and Reliability Analysis

Repeated-run evidence covering portfolio overlap, weight dispersion, frequent holdings, invalid responses, confidence dispersion, and official-run representativeness.

Report section and data export

Behavior Profile

A concise behavior description generated from measurable allocation evidence. Weak samples are labelled as early sample rather than overclassified.

Report section

Failure-Mode Register

A prioritized record of observed weaknesses, severity, evidence, business impact, recommended action, and retest condition.

Report table

Audit Packet

Evaluation manifest, exact prompt, frozen briefing, asset universe, response schema, hashes, raw response references, parsed submissions, run log, price files, calculations, metadata, retry record, methodology version, and known limitations.

Downloadable ZIP

Machine-readable export

Structured outputs for internal dashboards, notebooks, product documentation, or future regression tests.

CSV and JSON

Executive readout

One 60-minute session plus seven calendar days of email-based clarification. Calculation or transcription errors are corrected and the report reissued at no additional charge.

Video call and written follow-up

Failure-mode register example

Each finding includes a decision-useful record, not just a narrative observation.

Finding	Allocation changed materially across identical runs
Severity	High
Evidence	Weighted overlap fell below the approved threshold
Business impact	User experience and output reliability may vary
Recommended action	Add portfolio constraints or revise system prompt
Retest condition	Repeat after configuration update

Sample report preview

See the evidence before you buy

The ungated sample report uses public CapitalBench round CB-2026-05-28-1W and presents it in the private-report format.