Question 1

How is the readiness score calculated?

Accepted Answer

The readiness score (0-100) is a weighted average of four dimensions: Conversation Quality (40%), Tool Usage Accuracy (20%), Output Quality (20%), and Diagnostic Accuracy (20%). Each dimension is scored per-scenario based on predefined success criteria and failure indicators.

Question 2

What is a good readiness score?

Accepted Answer

Below 50 means critical failures — not safe for production. 50-69 means significant gaps requiring targeted fixes. 70-84 is production-capable with known limitations. 85-94 is a strong performer handling edge cases well. 95-100 is exceptional and rare, indicating thorough training and robust guardrails.

Question 3

Can I see why each score was given?

Accepted Answer

Yes. The rubric is fully transparent. For every scenario, you see which success criteria were met, which failure indicators were triggered, and the exact conversation turn where issues occurred. You also see the ground-truth response showing what the agent should have said.

Question 4

How does scoring work for agent-type scenarios?

Accepted Answer

Agent scenarios (tool usage, API calls, data retrieval) are scored on whether the agent took the correct actions in the correct order. Did it call the right API? Did it retrieve the right data? Did it create the correct record? Tool Usage Accuracy captures this dimension.

Question 5

Can I export the readiness report?

Accepted Answer

Yes. The readiness report is exportable as PDF for sharing with your team, stakeholders, or compliance reviewers. Deep Eval also includes exportable training assets as a ZIP file.

Watch Your Agent Pass or Fail — As It Happens

Every Response, Four Lenses

Conversation Quality

Tool Usage Accuracy

Output Quality

Diagnostic Accuracy

What Scores Mean

No Black Box

Criteria Defined First

Per-Turn Analysis

Ground Truth Comparison

Frequently asked questions

See Your Agent's Score

Related