# CapitalBench Private Eval Sprint - Evaluation Plan Template

This template is completed and approved before official execution. Material inputs and scoring rules are frozen before official outputs are collected.

## 1. System Under Test

- Client:
- System name:
- Model/provider/version:
- Endpoint or execution method:
- Prompt/configuration identifier:
- Tooling and browsing setting:
- Temperature or sampling settings:
- Other relevant settings:

## 2. Primary Business Question

- Decision supported by this evaluation:
- Internal success threshold, if any:
- Intended use of results:

## 3. Comparator Roster

- Comparator 1:
- Comparator 2:
- Comparator 3:
- Comparator 4:

Comparator provider identifiers and available version metadata are recorded before execution.

## 4. Evaluation Mode

- Controlled Benchmark Mode:
- Production Workflow Mode:
- Customer-executed runner:

Production workflow results are reported separately and are not mixed into a closed-capability headline comparison.

## 5. Frozen Input Packet

- Prompt:
- Briefing:
- Market data:
- Asset universe:
- Response schema:
- Portfolio constraints:
- Hash manifest:

## 6. Official-Run Policy

- Number of official calls:
- Retry conditions:
- Invalid-output handling:
- Transport-failure handling:
- Truncation handling:

One official run determines the live outcome score. Consistency runs are separately reported evidence.

## 7. Consistency Protocol

- Number of repeats:
- Sampling settings:
- Execution mode:
- Similarity metrics:
- Thresholds, if any:

## 8. Market Window

- Entry rule:
- Exit rule:
- Benchmark:
- Price source:
- Adjusted-price treatment:
- Non-trading-day fallback:
- Missing-price fallback:

## 9. Metrics

- Portfolio return
- Portfolio minus S&P 500
- CapitalBench Score
- Regret versus maximum
- Valid-output rate
- Allocation consistency
- Concentration
- Risk appetite
- Peer similarity
- Explanation quality
- Latency
- Cost per valid run

## 10. Confidentiality And Retention

- Confidentiality setting:
- Publication approval required:
- Credential deletion date:
- Raw-output retention:
- Final-report retention:
- Third-party provider terms noted:

## Approval

- Client approver:
- CapitalBench approver:
- Approval date:
- Scope version:
