Same frozen inputs
Every system receives the same approved evaluation packet.
PRIVATE CAPITALBENCH EVALS
CapitalBench independently evaluates your model, agent, or workflow using the same frozen market brief, portfolio constraints, and outcome window as leading AI systems.
You receive a private comparative scorecard, consistency analysis, failure-mode review, machine-readable results, and a complete audit packet.
Private by default · No source code or model weights required · No real capital traded
Launch price: $3,000 USD · Typical completion: 10-14 calendar days · Fixed scope
Every system receives the same approved evaluation packet.
Compare against up to four supported frontier models.
Official performance resolves using actual market prices.
Repeated runs reveal unstable or highly variable decisions.
Inputs, outputs, timestamps, hashes, prices, and calculations are preserved.
A model can produce an impressive investment thesis and still make unstable, overconcentrated, expensive, or poorly calibrated allocation decisions.
A Private CapitalBench Eval turns one specific model configuration into a documented decision record. It shows what the system selected, how consistently it selected it, what risk it took, how it compared with frontier alternatives, and how the allocation performed after the market window closed.
Compare your current system with frontier alternatives under identical conditions.
Test the exact model, prompt, workflow, and configuration you are considering shipping.
Identify concentration, instability, invalid outputs, excessive risk-taking, weak diversification, or poor cost-performance trade-offs.
Give product leaders, investors, research teams, and customers evidence that is more defensible than screenshots or selected examples.
| Scope item | Standard sprint |
|---|---|
| Client system | One model, agent, workflow, or configuration |
| Comparator roster | Up to four currently supported frontier models |
| Official evaluation | One live weekly CapitalBench allocation round |
| Consistency evaluation | Five separate non-official repeated runs per system |
| Asset universe | Current CapitalBench public ETF and cash universe |
| Portfolio format | One to five holdings, fixed allocation increments, 100% total |
| Information access | Same frozen brief, prompt, universe, and market-data packet |
| Base evaluation mode | Browsing, tools, and live retrieval disabled |
| Performance window | Approximately one week |
| Final output | Report, scorecard, data export, audit bundle, and readout |
| Confidentiality | Private unless the client approves publication in writing |
A sprint may include both modes, but their scores are never mixed into one misleading comparison.
The system receives the same frozen external information as every comparator. Browsing, search, code execution, live prices, and external retrieval are disabled where technically possible.
This mode isolates allocation judgment and creates the fairest comparison with the public CapitalBench benchmark.
The system runs using its normal retrieval, tools, prompts, or internal workflow.
Production-mode results are reported separately because a tool-enabled agent should not be placed on the same headline ranking as a closed-capability model receiving only frozen inputs.
Every buyer receives a short Evaluation Plan before execution.
| Field | What is frozen |
|---|---|
| System under test | Model name, version, endpoint, prompt configuration, and relevant settings |
| Primary business question | The decision the company wants the evaluation to support |
| Comparator roster | Exact comparator models and versions |
| Evaluation mode | Controlled or production workflow |
| Input packet | Prompt, briefing, market data, asset universe, and schema |
| Portfolio constraints | Holdings limit, allocation increments, cash rules, and total weight |
| Official-run policy | Number of official calls and allowed retry conditions |
| Consistency protocol | Number of repeats and settings |
| Market window | Entry rule, exit rule, benchmark, and date fallback |
| Metrics | Performance, consistency, validity, behavior, latency, and cost fields |
| Success thresholds | Any agreed internal release or acceptance thresholds |
| Confidentiality | Private, anonymized, or approved for public publication |
| Retention | Credential, raw-output, and report retention periods |
Nothing material is changed after the official outputs have been collected. If a rule must change because of a technical issue, the change is documented and approved before the market outcome is known.
You submit your system type, business question, preferred access method, and desired timing. CapitalBench confirms whether the system can be evaluated fairly.
A kickoff call defines the tested configuration, comparators, success criteria, security requirements, and evaluation mode.
The system is connected through a customer-hosted endpoint, temporary API credential, or customer-executed runner. A non-scored validation call confirms schema compatibility and access.
The prompt, briefing, asset universe, constraints, model roster, and scoring rules are finalized and hashed. Every system receives the approved packet. One official response is collected from each system. Invalid and failed responses remain part of the audit record.
The same task is repeated under the pre-approved consistency protocol. These runs measure whether each system maintains a similar allocation or changes materially from one attempt to the next.
The official portfolios remain frozen while market prices resolve. Interim mark-to-market data may be shown privately, but no final ranking is issued before the scheduled exit rule.
CapitalBench calculates returns, benchmark differences, maximum-possible context, consistency, concentration, risk, validity, cost, latency, and performance attribution.
CapitalBench walks the team through the findings, limitations, failure modes, and recommended next actions.
| Metric | What it answers |
|---|---|
| Portfolio return | How did the frozen allocation perform during the defined outcome window? |
| S&P 500 difference | Did it outperform or underperform the benchmark, and by how much? |
| CapitalBench Score | How did the return compare with the highest-returning eligible option in the same window? |
| Regret versus maximum | How much return separated the portfolio from the hindsight-best eligible asset? |
| Valid-output rate | Did the system reliably return a complete, schema-valid allocation? |
| Allocation consistency | Did repeated runs produce similar holdings and weights? |
| Concentration | How much capital was placed in the largest holding or narrowest theme? |
| Risk appetite | How strongly did the portfolio lean toward growth, momentum, cyclicality, defense, cash, or other risk profiles? |
| Peer similarity | Did the system follow the frontier-model consensus or take a distinctive position? |
| Explanation quality | Did its stated rationale match its allocation, identify relevant risks, and avoid unsupported claims? |
| Latency | How long did the system take to return a valid response? |
| Cost per valid run | What did each completed response cost when provider usage information was available? |
Portfolio return = sum of each holding's weight multiplied by its realized return.
Portfolio minus S&P 500 = portfolio return minus S&P 500 return.
CapitalBench Score = portfolio return relative to the hindsight-best eligible asset return in that same window.
Regret = hindsight-best eligible asset return minus portfolio return.
Consistency = similarity of holdings and weights across repeated identical tasks.
Performance, consistency, cost, and latency remain visible as separate fields. CapitalBench does not hide materially different trade-offs inside one unexplained composite grade.
The exact system, comparator roster, inputs, constraints, retry policy, metrics, timeline, confidentiality setting, and scoring rules.
PDF or MarkdownA decision-focused report explaining the result, what it means, major risks, limitations, and recommended next actions.
12-20 page PDFSide-by-side results for the client system and every comparator, including performance, consistency, validity, behavior, latency, and cost where available.
PDF table, CSV, and JSONHolding-level explanation of which allocations helped or hurt the official result, with entry and exit prices and comparator context.
Report section and CSVRepeated-run evidence covering portfolio overlap, weight dispersion, frequent holdings, invalid responses, confidence dispersion, and official-run representativeness.
Report section and data exportA concise behavior description generated from measurable allocation evidence. Weak samples are labelled as early sample rather than overclassified.
Report sectionA prioritized record of observed weaknesses, severity, evidence, business impact, recommended action, and retest condition.
Report tableEvaluation manifest, exact prompt, frozen briefing, asset universe, response schema, hashes, raw response references, parsed submissions, run log, price files, calculations, metadata, retry record, methodology version, and known limitations.
Downloadable ZIPStructured outputs for internal dashboards, notebooks, product documentation, or future regression tests.
CSV and JSONOne 60-minute session plus seven calendar days of email-based clarification. Calculation or transcription errors are corrected and the report reissued at no additional charge.
Video call and written follow-upThe ungated sample report uses public CapitalBench round CB-2026-05-28-1W and presents it in the private-report format.
Preview of the private report section, using public benchmark evidence and marked scope limits.
Preview of the private report section, using public benchmark evidence and marked scope limits.
Preview of the private report section, using public benchmark evidence and marked scope limits.
Preview of the private report section, using public benchmark evidence and marked scope limits.
Preview of the private report section, using public benchmark evidence and marked scope limits.
Preview of the private report section, using public benchmark evidence and marked scope limits.
The standard sprint does not require source code, model weights, production databases, customer records, brokerage credentials, fund holdings, or access to real capital.
| Access method | How it works | Best for |
|---|---|---|
| Customer-hosted endpoint | The client creates a temporary, restricted credential for a non-production or isolated endpoint. | Proprietary models and agents |
| Provider API access | CapitalBench runs the named model through an approved provider account or temporary client-provided key. | Commercial foundation models |
| Customer-executed runner | CapitalBench supplies the frozen packet and execution instructions. The client returns timestamped raw outputs and logs. | Strict internal security policies |
Customer-executed results are clearly identified because CapitalBench did not directly control execution. They may still be useful internally, but they provide weaker independent execution assurance.
Client identity, system details, prompts, outputs, scores, and reports are not published without written approval.
Credentials are used only to execute the agreed evaluation.
CapitalBench does not use private client inputs or outputs to train its own models.
Temporary credentials are removed after execution is complete.
Unless the SOW states otherwise, private raw outputs are retained for 30 days after final delivery and then deleted. Final report retention is governed by the SOW.
The standard sprint uses a CapitalBench market packet and does not require customer PII or live customer records.
The Evaluation Plan identifies any third-party model providers used. Their handling of API data remains governed by their applicable terms.
A mutual NDA may be completed before private technical details or credentials are shared.
CapitalBench is paid to design, execute, and document the agreed evaluation. Payment does not determine the score, model ranking, interpretation of valid results, or whether failures appear in the final report.
Inputs and rules are frozen before official execution. Failed and invalid attempts remain part of the record. Results are not replaced because a model selected an undesirable asset or performed poorly.
The client controls confidentiality and publication. It does not control the underlying calculation after the Evaluation Plan has been approved.
Launch pricing for the first five completed private engagements.
50% after the Evaluation Plan and SOW are approved. 50% when the final report and audit packet are delivered.
Comparator model API costs within the agreed scope are included. Usage charges incurred through the client's own endpoint or provider account remain the client's responsibility. Applicable taxes are additional.
Standard price after the first five completed engagements: $5,000 USD
No payment is required to submit a request. CapitalBench confirms fit, access method, comparator roster, timeline, and fixed scope before issuing an invoice.
| Add-on | Price |
|---|---|
| Additional client model, prompt, or configuration | $750 |
| Additional live weekly round | $1,250 |
| One-month outcome track | $2,000 |
| Custom asset universe or scoring rubric | From $1,500 |
| Production workflow mode with tools or retrieval | From $2,500 |
| Additional executive or technical workshop | $500 |
| Ongoing private monitoring | From $2,500/month |
Every add-on is agreed in writing before execution. CapitalBench does not add unapproved hourly charges.
Evaluate the model behind an AI analyst, investment copilot, portfolio assistant, or research workflow.
Compare vendors or configurations before committing engineering effort to a production integration.
Evaluate an internal model, agent, or research workflow without connecting CapitalBench to brokerage systems or real capital.
Measure domain-specific allocation behavior that generic knowledge and reasoning benchmarks do not directly reveal.
Independently inspect performance claims made by a company building an AI investment product.
A one-week market result can be affected by noise, unusual events, and the chosen asset universe. It is a real outcome, but it is not by itself proof of persistent investment skill.
The result applies only to the tested model version, configuration, input packet, constraints, and outcome window. A model update or workflow change may produce different behavior.
CapitalBench therefore reports performance alongside consistency, validity, concentration, risk, cost, latency, and explicit limitations. Companies seeking stronger evidence should run multiple weekly and monthly rounds.
It is a fixed-scope evaluation of one AI model, agent, workflow, or configuration on a controlled capital-allocation task. The system is compared with frontier models using frozen inputs and an agreed scoring protocol.
Common decisions include selecting a model vendor, validating a new release, comparing a fine-tune with its base model, diagnosing unstable behavior, establishing a regression baseline, or substantiating a limited performance claim.
API-accessible foundation models, fine-tuned models, proprietary models, RAG systems, investment research agents, prompt workflows, and third-party AI products may be evaluated when a stable execution method can be established.
The standard sprint includes one client configuration. A second model, prompt, tool setup, or version can be added for $750.
Usually, provided it can return the required structured output. CapitalBench supplies the schema and runs a non-scored validation test before the official evaluation.
No. The standard evaluation requires only a working execution method, such as an API endpoint, temporary provider credential, or customer-executed runner.
No. CapitalBench evaluates simulated allocations and market outcomes. It does not place trades or connect to brokerage accounts.
The buyer and CapitalBench agree on up to four currently supported comparators before execution. The exact provider, model identifier, and version information available at the time are recorded in the Evaluation Plan.
In Controlled Benchmark Mode, every system receives the same approved prompt, factual briefing, asset universe, market-data packet, portfolio constraints, and response schema.
Not in the standard Controlled Benchmark Mode. Tools and live retrieval are disabled where technically possible. A tool-enabled Production Workflow Mode is available as an add-on and is reported separately.
The raw attempt is retained and marked invalid. A retry is permitted only under the pre-approved policy for transport, truncation, malformed structured output, or similar infrastructure failure. A model is not retried merely because its allocation appears weak.
No. The live outcome score uses one official frozen response. Repeated runs are used only for separately reported consistency analysis.
No material scoring or input rule is changed after the official responses are collected. Any necessary pre-outcome amendment is documented and approved.
The Evaluation Plan states the market-data source, entry rule, exit rule, benchmark, adjusted-price treatment, and fallback for non-trading days or unavailable data before execution.
Live execution reduces the risk that models already contain or infer the outcome. Historical replay may be useful for diagnostics, but it is labelled separately and is not presented as equivalent to a prospective live result.
No. One round is a defined decision event, not proof of persistent skill. It can still expose differences in allocation, consistency, concentration, validity, latency, and behavior. Multiple rounds provide stronger evidence.
Only when measurable thresholds are agreed before execution. Otherwise, the report provides a comparative scorecard and a recommendation such as ship, iterate, switch, or gather more evidence.
Yes. Private is the default. The client's identity, system, prompts, outputs, scores, and report are not published without written approval.
A mutual NDA can be completed before confidential technical details or access credentials are exchanged.
CapitalBench does not use private client inputs or outputs to train its own models. Any third-party API handling remains subject to the named provider's terms.
The recommended default is 30 days after final delivery for raw private artifacts, unless the SOW specifies another period. Temporary credentials should be deleted immediately after execution.
CapitalBench does not currently represent itself as SOC 2 or ISO 27001 certified. The standard sprint is designed not to require PII, production customer data, source code, or brokerage access. Customer-hosted execution is available where internal policy requires it.
A customer-executed runner or customer-hosted endpoint can be used. The report clearly identifies who controlled execution and any resulting assurance limitations.
The result remains part of the private report. CapitalBench does not replace it with a more favorable run. The report focuses on why the system behaved as it did and what should change before a retest.
Yes, with written agreement on the exact system version, evaluation date, scope, limitations, and wording. Publication is optional.
Only with written approval. Any public statement must identify the tested version and evaluation window and must not imply certification, durable outperformance, or a broader endorsement than the evidence supports.
No. Payment covers the evaluation work, report, and delivery. It does not purchase a ranking, score, testimonial, badge, or endorsement.
Not without client approval. Private results remain private, but they are not removed from the report delivered to the client.
One client configuration, four comparators, one live weekly round, consistency testing, a report, CSV and JSON exports, an audit packet, a 60-minute readout, and seven days of written follow-up.
Comparator model usage within the agreed scope is included. Charges incurred through the client's own endpoint or provider account remain the client's responsibility.
A standard weekly sprint typically takes 10-14 calendar days after access and scope are approved. Market holidays, data exceptions, or client delays may extend delivery.
Usually two to three hours in total: one kickoff, technical connection support, Evaluation Plan approval, and the final readout.
Fifty percent is due after the SOW and Evaluation Plan are approved. The remaining fifty percent is due when the report and audit packet are delivered.
Before technical setup begins, the deposit is refundable. After the Evaluation Plan is approved and official execution begins, the deposit is non-refundable. If CapitalBench cannot complete the agreed scope, fees for undelivered work are refunded.
No subscription is required for the sprint. Additional rounds or ongoing monitoring are optional.
The client can implement the recommendations, test a revised configuration, add monthly rounds, or establish ongoing private monitoring. Any next engagement is separately scoped.
No. CapitalBench evaluates AI-system behavior using simulated allocations and market data. It does not recommend that the client or any other person trade the evaluated portfolio.
No. The report is research and evaluation evidence, not regulatory approval, legal advice, fiduciary advice, a security audit, or a guarantee of production suitability.
Submit the system you want evaluated and the decision you need to make. CapitalBench will confirm fit, access method, comparator roster, timing, and fixed scope before you pay.
No payment at submission. Private by default. NDA available before technical access.
The next step is a fit review followed by a proposed Evaluation Plan and fixed SOW. No payment or system access is requested until scope is agreed.
CapitalBench provides AI-model evaluation and research services. Results apply only to the tested system, version, configuration, inputs, constraints, and evaluation window. Results do not constitute investment advice, a recommendation to trade, certification, regulatory approval, endorsement, or a guarantee of future performance. Short market windows may be affected by noise and unusual events. The applicable SOW governs confidentiality, data handling, permitted use, payment, and delivery.