PRIVATE CAPITALBENCH EVALS

Benchmark your AI investment system against frontier models.

CapitalBench independently evaluates your model, agent, or workflow using the same frozen market brief, portfolio constraints, and outcome window as leading AI systems.

You receive a private comparative scorecard, consistency analysis, failure-mode review, machine-readable results, and a complete audit packet.

Private by default · No source code or model weights required · No real capital traded

Launch price: $3,000 USD · Typical completion: 10-14 calendar days · Fixed scope

Proof points

Same frozen inputs

Every system receives the same approved evaluation packet.

Current comparators

Compare against up to four supported frontier models.

Real outcomes

Official performance resolves using actual market prices.

Consistency testing

Repeated runs reveal unstable or highly variable decisions.

Auditable files

Inputs, outputs, timestamps, hashes, prices, and calculations are preserved.

Decision evidence

Make a consequential model decision with evidence.

A model can produce an impressive investment thesis and still make unstable, overconcentrated, expensive, or poorly calibrated allocation decisions.

A Private CapitalBench Eval turns one specific model configuration into a documented decision record. It shows what the system selected, how consistently it selected it, what risk it took, how it compared with frontier alternatives, and how the allocation performed after the market window closed.

Choose the right model

Compare your current system with frontier alternatives under identical conditions.

Validate a release

Test the exact model, prompt, workflow, and configuration you are considering shipping.

Diagnose behavior

Identify concentration, instability, invalid outputs, excessive risk-taking, weak diversification, or poor cost-performance trade-offs.

Support internal claims

Give product leaders, investors, research teams, and customers evidence that is more defensible than screenshots or selected examples.

Standard sprint scope

One defined system. One controlled evaluation. No moving goalposts.

Scope item Standard sprint
Client system One model, agent, workflow, or configuration
Comparator roster Up to four currently supported frontier models
Official evaluation One live weekly CapitalBench allocation round
Consistency evaluation Five separate non-official repeated runs per system
Asset universe Current CapitalBench public ETF and cash universe
Portfolio format One to five holdings, fixed allocation increments, 100% total
Information access Same frozen brief, prompt, universe, and market-data packet
Base evaluation mode Browsing, tools, and live retrieval disabled
Performance window Approximately one week
Final output Report, scorecard, data export, audit bundle, and readout
Confidentiality Private unless the client approves publication in writing
Evaluation modes

Controlled mode versus production mode

A sprint may include both modes, but their scores are never mixed into one misleading comparison.

Included in the $3,000 sprint

Controlled Benchmark Mode

The system receives the same frozen external information as every comparator. Browsing, search, code execution, live prices, and external retrieval are disabled where technically possible.

This mode isolates allocation judgment and creates the fairest comparison with the public CapitalBench benchmark.

Optional add-on

Production Workflow Mode

The system runs using its normal retrieval, tools, prompts, or internal workflow.

Production-mode results are reported separately because a tool-enabled agent should not be placed on the same headline ranking as a closed-capability model receiving only frozen inputs.

Evaluation Plan

The rules are approved and frozen before any official run.

Every buyer receives a short Evaluation Plan before execution.

Field What is frozen
System under test Model name, version, endpoint, prompt configuration, and relevant settings
Primary business question The decision the company wants the evaluation to support
Comparator roster Exact comparator models and versions
Evaluation mode Controlled or production workflow
Input packet Prompt, briefing, market data, asset universe, and schema
Portfolio constraints Holdings limit, allocation increments, cash rules, and total weight
Official-run policy Number of official calls and allowed retry conditions
Consistency protocol Number of repeats and settings
Market window Entry rule, exit rule, benchmark, and date fallback
Metrics Performance, consistency, validity, behavior, latency, and cost fields
Success thresholds Any agreed internal release or acceptance thresholds
Confidentiality Private, anonymized, or approved for public publication
Retention Credential, raw-output, and report retention periods

Nothing material is changed after the official outputs have been collected. If a rule must change because of a technical issue, the change is documented and approved before the market outcome is known.

Process

From system access to final report

1 Before purchase

Fit review

You submit your system type, business question, preferred access method, and desired timing. CapitalBench confirms whether the system can be evaluated fairly.

Client time
approximately 15-20 minutes
Output
fit confirmation and proposed scope
2 Day 1

Evaluation design

A kickoff call defines the tested configuration, comparators, success criteria, security requirements, and evaluation mode.

Client time
approximately 45 minutes
Output
draft Evaluation Plan
3 Day 1-2

Technical connection

The system is connected through a customer-hosted endpoint, temporary API credential, or customer-executed runner. A non-scored validation call confirms schema compatibility and access.

Client time
approximately 30-60 minutes with a technical contact
Output
verified connection without consuming the official run
4 Day 2-3

Freeze and official execution

The prompt, briefing, asset universe, constraints, model roster, and scoring rules are finalized and hashed. Every system receives the approved packet. One official response is collected from each system. Invalid and failed responses remain part of the audit record.

Client time
no live meeting required after approval
Output
frozen allocations and audit hashes
5 Day 2-3

Consistency runs

The same task is repeated under the pre-approved consistency protocol. These runs measure whether each system maintains a similar allocation or changes materially from one attempt to the next.

Client time
no live meeting required
Output
consistency and allocation-dispersion dataset
6 Approximately seven calendar days

Live outcome window

The official portfolios remain frozen while market prices resolve. Interim mark-to-market data may be shown privately, but no final ranking is issued before the scheduled exit rule.

Client time
none unless a data exception needs approval
Output
final entry and exit price records
7 One to two business days after the exit price

Scoring and analysis

CapitalBench calculates returns, benchmark differences, maximum-possible context, consistency, concentration, risk, validity, cost, latency, and performance attribution.

Client time
review questions only
Output
final scorecard and report
8 Final delivery

Executive readout

CapitalBench walks the team through the findings, limitations, failure modes, and recommended next actions.

Client time
one 60-minute readout
Output
readout and seven days of written follow-up questions
Metrics and scoring

Performance is only one part of the scorecard.

Metric What it answers
Portfolio return How did the frozen allocation perform during the defined outcome window?
S&P 500 difference Did it outperform or underperform the benchmark, and by how much?
CapitalBench Score How did the return compare with the highest-returning eligible option in the same window?
Regret versus maximum How much return separated the portfolio from the hindsight-best eligible asset?
Valid-output rate Did the system reliably return a complete, schema-valid allocation?
Allocation consistency Did repeated runs produce similar holdings and weights?
Concentration How much capital was placed in the largest holding or narrowest theme?
Risk appetite How strongly did the portfolio lean toward growth, momentum, cyclicality, defense, cash, or other risk profiles?
Peer similarity Did the system follow the frontier-model consensus or take a distinctive position?
Explanation quality Did its stated rationale match its allocation, identify relevant risks, and avoid unsupported claims?
Latency How long did the system take to return a valid response?
Cost per valid run What did each completed response cost when provider usage information was available?
See calculation details

Portfolio return = sum of each holding's weight multiplied by its realized return.

Portfolio minus S&P 500 = portfolio return minus S&P 500 return.

CapitalBench Score = portfolio return relative to the hindsight-best eligible asset return in that same window.

Regret = hindsight-best eligible asset return minus portfolio return.

Consistency = similarity of holdings and weights across repeated identical tasks.

Performance, consistency, cost, and latency remain visible as separate fields. CapitalBench does not hide materially different trade-offs inside one unexplained composite grade.

Deliverables

Everything you receive

1

Approved Evaluation Plan

The exact system, comparator roster, inputs, constraints, retry policy, metrics, timeline, confidentiality setting, and scoring rules.

PDF or Markdown
2

Executive Evaluation Report

A decision-focused report explaining the result, what it means, major risks, limitations, and recommended next actions.

12-20 page PDF
3

Comparative Scorecard

Side-by-side results for the client system and every comparator, including performance, consistency, validity, behavior, latency, and cost where available.

PDF table, CSV, and JSON
4

Performance Attribution

Holding-level explanation of which allocations helped or hurt the official result, with entry and exit prices and comparator context.

Report section and CSV
5

Consistency and Reliability Analysis

Repeated-run evidence covering portfolio overlap, weight dispersion, frequent holdings, invalid responses, confidence dispersion, and official-run representativeness.

Report section and data export
6

Behavior Profile

A concise behavior description generated from measurable allocation evidence. Weak samples are labelled as early sample rather than overclassified.

Report section
7

Failure-Mode Register

A prioritized record of observed weaknesses, severity, evidence, business impact, recommended action, and retest condition.

Report table
8

Audit Packet

Evaluation manifest, exact prompt, frozen briefing, asset universe, response schema, hashes, raw response references, parsed submissions, run log, price files, calculations, metadata, retry record, methodology version, and known limitations.

Downloadable ZIP
9

Machine-readable export

Structured outputs for internal dashboards, notebooks, product documentation, or future regression tests.

CSV and JSON
10

Executive readout

One 60-minute session plus seven calendar days of email-based clarification. Calculation or transcription errors are corrected and the report reissued at no additional charge.

Video call and written follow-up
Sample report preview

See the evidence before you buy

The ungated sample report uses public CapitalBench round CB-2026-05-28-1W and presents it in the private-report format.

01

Executive summary

Preview of the private report section, using public benchmark evidence and marked scope limits.

02

Comparative scorecard

Preview of the private report section, using public benchmark evidence and marked scope limits.

03

Allocation attribution

Preview of the private report section, using public benchmark evidence and marked scope limits.

04

Consistency analysis

Preview of the private report section, using public benchmark evidence and marked scope limits.

05

Risk and concentration

Preview of the private report section, using public benchmark evidence and marked scope limits.

06

Audit-file index

Preview of the private report section, using public benchmark evidence and marked scope limits.

Security and data handling

Designed to evaluate the system without taking control of it.

The standard sprint does not require source code, model weights, production databases, customer records, brokerage credentials, fund holdings, or access to real capital.

Access method How it works Best for
Customer-hosted endpoint The client creates a temporary, restricted credential for a non-production or isolated endpoint. Proprietary models and agents
Provider API access CapitalBench runs the named model through an approved provider account or temporary client-provided key. Commercial foundation models
Customer-executed runner CapitalBench supplies the frozen packet and execution instructions. The client returns timestamped raw outputs and logs. Strict internal security policies

Customer-executed results are clearly identified because CapitalBench did not directly control execution. They may still be useful internally, but they provide weaker independent execution assurance.

Private by default

Client identity, system details, prompts, outputs, scores, and reports are not published without written approval.

Purpose-limited access

Credentials are used only to execute the agreed evaluation.

No training use

CapitalBench does not use private client inputs or outputs to train its own models.

Credential deletion

Temporary credentials are removed after execution is complete.

Defined retention

Unless the SOW states otherwise, private raw outputs are retained for 30 days after final delivery and then deleted. Final report retention is governed by the SOW.

No sensitive production data required

The standard sprint uses a CapitalBench market packet and does not require customer PII or live customer records.

Provider transparency

The Evaluation Plan identifies any third-party model providers used. Their handling of API data remains governed by their applicable terms.

NDA available

A mutual NDA may be completed before private technical details or credentials are shared.

Independence policy

Payment buys the evaluation, not a favorable result.

CapitalBench is paid to design, execute, and document the agreed evaluation. Payment does not determine the score, model ranking, interpretation of valid results, or whether failures appear in the final report.

Inputs and rules are frozen before official execution. Failed and invalid attempts remain part of the record. Results are not replaced because a model selected an undesirable asset or performed poorly.

The client controls confidentiality and publication. It does not control the underlying calculation after the Evaluation Plan has been approved.

No pay-to-rank No private answer shopping No retroactive scoring changes No publication without client approval
Pricing

Fixed scope. Transparent price.

Private CapitalBench Eval Sprint

$3,000 USD

Launch pricing for the first five completed private engagements.

Request a Founding Sprint

Included

  • One client system or configuration
  • Up to four frontier comparator models
  • One official live weekly evaluation
  • Five consistency runs per system
  • Current CapitalBench asset universe
  • Performance, risk, concentration, and consistency scoring
  • Cost and latency analysis where available
  • 12-20 page private report
  • CSV and JSON exports
  • Complete audit packet
  • 60-minute executive readout
  • Seven days of written follow-up

Payment terms

50% after the Evaluation Plan and SOW are approved. 50% when the final report and audit packet are delivered.

Cost details

Comparator model API costs within the agreed scope are included. Usage charges incurred through the client's own endpoint or provider account remain the client's responsibility. Applicable taxes are additional.

Future standard price

Standard price after the first five completed engagements: $5,000 USD

No payment is required to submit a request. CapitalBench confirms fit, access method, comparator roster, timeline, and fixed scope before issuing an invoice.

Add-ons

Add-on Price
Additional client model, prompt, or configuration $750
Additional live weekly round $1,250
One-month outcome track $2,000
Custom asset universe or scoring rubric From $1,500
Production workflow mode with tools or retrieval From $2,500
Additional executive or technical workshop $500
Ongoing private monitoring From $2,500/month

Every add-on is agreed in writing before execution. CapitalBench does not add unapproved hourly charges.

Fit

Built for teams putting AI near investment decisions

AI finance and research products

Evaluate the model behind an AI analyst, investment copilot, portfolio assistant, or research workflow.

Fintech and wealthtech companies

Compare vendors or configurations before committing engineering effort to a production integration.

Funds and investment research teams

Evaluate an internal model, agent, or research workflow without connecting CapitalBench to brokerage systems or real capital.

Model providers and evaluation teams

Measure domain-specific allocation behavior that generic knowledge and reasoning benchmarks do not directly reveal.

Investors and diligence teams

Independently inspect performance claims made by a company building an AI investment product.

Limitations

What a responsible evaluation can and cannot prove

A one-week market result can be affected by noise, unusual events, and the chosen asset universe. It is a real outcome, but it is not by itself proof of persistent investment skill.

The result applies only to the tested model version, configuration, input packet, constraints, and outcome window. A model update or workflow change may produce different behavior.

CapitalBench therefore reports performance alongside consistency, validity, concentration, risk, cost, latency, and explicit limitations. Companies seeking stronger evidence should run multiple weekly and monthly rounds.

FAQ

Complete FAQ

Scope and fit

What is a Private CapitalBench Eval Sprint?

It is a fixed-scope evaluation of one AI model, agent, workflow, or configuration on a controlled capital-allocation task. The system is compared with frontier models using frozen inputs and an agreed scoring protocol.

What business decision is it designed to support?

Common decisions include selecting a model vendor, validating a new release, comparing a fine-tune with its base model, diagnosing unstable behavior, establishing a regression baseline, or substantiating a limited performance claim.

What kinds of systems can be evaluated?

API-accessible foundation models, fine-tuned models, proprietary models, RAG systems, investment research agents, prompt workflows, and third-party AI products may be evaluated when a stable execution method can be established.

Can you evaluate two versions of our system?

The standard sprint includes one client configuration. A second model, prompt, tool setup, or version can be added for $750.

Can you evaluate a system that does not produce portfolios today?

Usually, provided it can return the required structured output. CapitalBench supplies the schema and runs a non-scored validation test before the official evaluation.

Do you need access to our source code or model weights?

No. The standard evaluation requires only a working execution method, such as an API endpoint, temporary provider credential, or customer-executed runner.

Do you need our brokerage account or real capital?

No. CapitalBench evaluates simulated allocations and market outcomes. It does not place trades or connect to brokerage accounts.

Method and fairness

How are comparator models selected?

The buyer and CapitalBench agree on up to four currently supported comparators before execution. The exact provider, model identifier, and version information available at the time are recorded in the Evaluation Plan.

Does every model receive the same information?

In Controlled Benchmark Mode, every system receives the same approved prompt, factual briefing, asset universe, market-data packet, portfolio constraints, and response schema.

Can a model browse the web or use tools?

Not in the standard Controlled Benchmark Mode. Tools and live retrieval are disabled where technically possible. A tool-enabled Production Workflow Mode is available as an add-on and is reported separately.

What happens if a model returns invalid output?

The raw attempt is retained and marked invalid. A retry is permitted only under the pre-approved policy for transport, truncation, malformed structured output, or similar infrastructure failure. A model is not retried merely because its allocation appears weak.

Do you choose the best answer from repeated runs?

No. The live outcome score uses one official frozen response. Repeated runs are used only for separately reported consistency analysis.

Can the rules change after the market result is known?

No material scoring or input rule is changed after the official responses are collected. Any necessary pre-outcome amendment is documented and approved.

How are prices selected?

The Evaluation Plan states the market-data source, entry rule, exit rule, benchmark, adjusted-price treatment, and fallback for non-trading days or unavailable data before execution.

Why use a live market window instead of only historical data?

Live execution reduces the risk that models already contain or infer the outcome. Historical replay may be useful for diagnostics, but it is labelled separately and is not presented as equivalent to a prospective live result.

Is one week enough to determine which model is best?

No. One round is a defined decision event, not proof of persistent skill. It can still expose differences in allocation, consistency, concentration, validity, latency, and behavior. Multiple rounds provide stronger evidence.

Will we receive a pass or fail?

Only when measurable thresholds are agreed before execution. Otherwise, the report provides a comparative scorecard and a recommendation such as ship, iterate, switch, or gather more evidence.

Data and confidentiality

Are the evaluation and results private?

Yes. Private is the default. The client's identity, system, prompts, outputs, scores, and report are not published without written approval.

Will CapitalBench sign an NDA?

A mutual NDA can be completed before confidential technical details or access credentials are exchanged.

Will our data be used for training?

CapitalBench does not use private client inputs or outputs to train its own models. Any third-party API handling remains subject to the named provider's terms.

How long is private data retained?

The recommended default is 30 days after final delivery for raw private artifacts, unless the SOW specifies another period. Temporary credentials should be deleted immediately after execution.

Are you SOC 2 or ISO 27001 certified?

CapitalBench does not currently represent itself as SOC 2 or ISO 27001 certified. The standard sprint is designed not to require PII, production customer data, source code, or brokerage access. Customer-hosted execution is available where internal policy requires it.

Can the evaluation run inside our environment?

A customer-executed runner or customer-hosted endpoint can be used. The report clearly identifies who controlled execution and any resulting assurance limitations.

Results and publication

What happens if our system performs poorly?

The result remains part of the private report. CapitalBench does not replace it with a more favorable run. The report focuses on why the system behaved as it did and what should change before a retest.

Can we publish the result?

Yes, with written agreement on the exact system version, evaluation date, scope, limitations, and wording. Publication is optional.

Can we use the CapitalBench name or logo in marketing?

Only with written approval. Any public statement must identify the tested version and evaluation window and must not imply certification, durable outperformance, or a broader endorsement than the evidence supports.

Does paying for the sprint guarantee a favorable result?

No. Payment covers the evaluation work, report, and delivery. It does not purchase a ranking, score, testimonial, badge, or endorsement.

Does CapitalBench publish negative private evaluations?

Not without client approval. Private results remain private, but they are not removed from the report delivered to the client.

Pricing and delivery

What is included in the $3,000 price?

One client configuration, four comparators, one live weekly round, consistency testing, a report, CSV and JSON exports, an audit packet, a 60-minute readout, and seven days of written follow-up.

Are model API costs included?

Comparator model usage within the agreed scope is included. Charges incurred through the client's own endpoint or provider account remain the client's responsibility.

How long does the sprint take?

A standard weekly sprint typically takes 10-14 calendar days after access and scope are approved. Market holidays, data exceptions, or client delays may extend delivery.

How much time does our team need to provide?

Usually two to three hours in total: one kickoff, technical connection support, Evaluation Plan approval, and the final readout.

When is payment due?

Fifty percent is due after the SOW and Evaluation Plan are approved. The remaining fifty percent is due when the report and audit packet are delivered.

What is the cancellation policy?

Before technical setup begins, the deposit is refundable. After the Evaluation Plan is approved and official execution begins, the deposit is non-refundable. If CapitalBench cannot complete the agreed scope, fees for undelivered work are refunded.

Is there an ongoing subscription?

No subscription is required for the sprint. Additional rounds or ongoing monitoring are optional.

What happens after the sprint?

The client can implement the recommendations, test a revised configuration, add monthly rounds, or establish ongoing private monitoring. Any next engagement is separately scoped.

Is this investment advice?

No. CapitalBench evaluates AI-system behavior using simulated allocations and market data. It does not recommend that the client or any other person trade the evaluated portfolio.

Is this a regulatory or model-risk certification?

No. The report is research and evaluation evidence, not regulatory approval, legal advice, fiduciary advice, a security audit, or a guarantee of production suitability.

Request scope review

Your AI may sound convincing. Find out how it allocates.

Submit the system you want evaluated and the decision you need to make. CapitalBench will confirm fit, access method, comparator roster, timing, and fixed scope before you pay.

No payment at submission. Private by default. NDA available before technical access.

Intake form

Request Scope Review

The next step is a fit review followed by a proposed Evaluation Plan and fixed SOW. No payment or system access is requested until scope is agreed.

Procurement

Procurement and billing

Service type Fixed-scope professional model evaluation
Contract Mutual NDA and SOW available
Billing currency USD
Payment Invoice and supported electronic payment methods
Primary contact [email protected]
Security inquiries [email protected]
Contracting legal entity Specified in the SOW before invoice issuance
Business jurisdiction Specified in the SOW before invoice issuance
Billing address Provided on the invoice and SOW
Tax information Provided on the invoice where applicable
Legal contact [email protected]

CapitalBench provides AI-model evaluation and research services. Results apply only to the tested system, version, configuration, inputs, constraints, and evaluation window. Results do not constitute investment advice, a recommendation to trade, certification, regulatory approval, endorsement, or a guarantee of future performance. Short market windows may be affected by noise and unusual events. The applicable SOW governs confidentiality, data handling, permitted use, payment, and delivery.

Request a Private Eval