Plans

Pricing

One-time per evaluation. No subscriptions. No seat fees.

Free Discovery

Connect your first agent
File analysis — we extract capabilities, limitations, and tools from your agent’s files
20 discovery probes — we test what your agent can and can’t do
Readiness score
Industry auto-detection
Browse matched scenarios

No credit card required.

Request a Demo

Standard Eval

$149one-time

Everything in Free, plus:

30 scenario simulations — real conversations with angry customers, edge cases, and compliance traps
Every response scored on accuracy, tone, policy compliance, and task completion
Failure analysis — exact turns where your agent broke and why
Readiness report (PDF) — share with your team or stakeholders

Request a Demo

Deep Eval

$349one-time

Everything in Standard, plus:

100 scenario simulations across your full scope
Mock CRM, file systems, and API data — test how your agent handles real workflows
Training assets (ZIP) — skill files, guardrails, and routing rules to fix the gaps we found
Custom scenarios generated for YOUR agent’s specific capabilities
Re-evaluate at 90 days — verify your fixes worked

Request a Demo

Frequently Asked Questions

What types of agents can I test?

Any AI agent with an API endpoint or configuration files. We support Claude Code agents with skill files, Custom GPTs with system prompts, and any agent that responds to HTTP requests. For skill-file agents, upload your .md, .txt, .json, or .yaml files directly — no API endpoint needed. For API agents, we support OpenAI, Anthropic, and custom request formats.

How long does an evaluation take?

Discovery takes about 30 seconds for skill-file agents, 2-3 minutes for API agents. A 30-scenario Standard Evaluation takes 15-20 minutes. A 100-scenario Deep Evaluation takes about 45 minutes. Results are available immediately after completion — no waiting period.

Do I need to give you access to my systems?

No — but you should connect a test environment, not production. For API agents, we send realistic messages to your endpoint and score the responses. This includes scenarios that ask your agent to create records, send emails, process refunds, and update data. If your agent performs real actions, those will execute against whatever system it’s connected to. We send an X-Test-Mode: true header with every request. If your agent supports it, use this header to disable side effects during evaluation. If it doesn’t, connect a staging endpoint instead. We never access your databases, CRM, or internal systems directly — we only interact through the same interface your users would. For skill-file agents, there’s no risk — we simulate the entire environment using mock infrastructure. No real systems are touched.

What industries do you support?

We have scenarios for 15 industries: E-commerce, Customer Support, SaaS, GTM/Sales, RevOps, HR/Recruiting, Insurance, Marketing, Field Service/Construction, Logistics, Real Estate, Government/Public Safety, Education, Hospitality, and Finance/Accounting. We also generate custom scenarios based on your agent's specific capabilities — if your industry isn't listed, we'll create scenarios for it.

Is my agent's data secure?

Yes. Skill files and simulation data are encrypted at rest and in transit. Each account's data is isolated. We do not train on your agent's responses or skill files. Simulation transcripts are stored for your review and can be deleted at any time. We are SOC 2 Type II compliant.

What's a mock infrastructure?

For agents built in Claude Code or similar environments, your agent may need access to file systems, persistent memory, CRM data, or API endpoints to demonstrate its full capabilities. Mock infrastructure simulates these systems during evaluation. We provide pre-built mock data packs for Salesforce, HubSpot, Shopify, and other platforms — complete with realistic customer records, open invoices, and support tickets.

What is AI agent evaluation?

AI agent evaluation is the process of systematically testing an AI agent against realistic scenarios before deploying it to production. It's like QA testing for traditional software, but adapted for conversational AI. Instead of unit tests, we simulate real conversations — angry customers, compliance traps, edge cases, adversarial prompts — and score every response on accuracy, tone, policy compliance, and task completion.

How do you test a Claude Code agent?

Upload your agent's skill files (.md files), CLAUDE.md configuration, and any supporting files. We analyze them to extract capabilities, persona rules, tool definitions, and workflow steps. Then we run scenarios that test those specific capabilities. We simulate the Claude Code environment — file system access, persistent memory, tool calls — so your agent can demonstrate its full workflow without needing an API endpoint.

How do you test a Custom GPT?

You can either upload your Custom GPT's system prompt and configuration as text files, or connect via the OpenAI API endpoint. We detect the response format automatically and run scenarios targeting the capabilities described in your system prompt. Custom GPTs with Actions (API calls) are tested with mock API responses so the agent can demonstrate its full workflow.

What's the difference between red-teaming and agent evaluation?

Red-teaming focuses on finding security vulnerabilities — prompt injection, jailbreaks, data leakage. Agent evaluation is broader: it tests whether your agent actually does its job correctly in realistic scenarios. We include adversarial scenarios (our version of red-teaming) but also test normal workflows, edge cases, compliance requirements, and multi-step processes. Think of red-teaming as testing the locks on the doors; agent evaluation tests whether the entire house works.

How is the readiness score calculated?

The readiness score (0-100) is a weighted average of four dimensions: Conversation Quality (40%) — does the agent communicate clearly, empathetically, and professionally? Tool Usage Accuracy (20%) — does it use its tools correctly and efficiently? Output Quality (20%) — are the outputs accurate, complete, and well-formatted? Diagnostic Accuracy (20%) — does it identify the right problem and recommend the right solution?

What are training assets and how do I use them?

Training assets are structured files generated from your evaluation results. They include updated skill files with corrected guardrails, routing rules that prevent the failures we found, I/O schemas defining expected inputs and outputs, and example conversation pairs showing the correct response for every failed scenario. Download as a ZIP, add the files to your agent's configuration directory, and re-evaluate to verify the fixes worked.

Can I test my agent without giving you API access?

Yes. Upload your agent's skill files or system prompt directly. We'll simulate the agent locally using our evaluation engine and run scenarios against the uploaded configuration. This is actually the most common way our customers use Agent Scrimmage — especially Claude Code and Custom GPT builders who have skill files but no separate API endpoint.

Do you store my agent's responses?

Simulation transcripts (the conversation between the simulated user and your agent) are stored for your review in your account dashboard. You can view, export, and delete them at any time. We do not use your agent's responses for training our own models. Skill files and configuration data are encrypted and isolated per account.