Evaluations
Evaluations use an AI judge to score your agent’s responses across multiple quality criteria. Instead of manually reviewing every conversation, set up automated evaluations that measure relevance, accuracy, helpfulness, and clarity — or define your own custom criteria.How It Works
- You provide an input (user query) and output (agent response).
- A judge model evaluates the output against scoring criteria.
- The judge returns scores for each criterion plus an overall score.
- Results are stored and aggregated for trend analysis.
Default Criteria
| Criterion | Description |
|---|---|
| Relevance | How relevant is the response to the user’s query? |
| Accuracy | Is the information provided factually correct? |
| Helpfulness | How helpful and actionable is the response? |
| Clarity | Is the response clear and well-structured? |
Running Evaluations
From the Dashboard
- Go to your agent’s Evaluations tab.
- Click Run Evaluation.
- Enter or select a conversation to evaluate.
- Choose the judge model (default: GPT-4o-mini).
- Optionally customize scoring criteria.
- Click Evaluate.
Single Evaluation via API
Response
Batch Evaluation
Evaluate multiple input/output pairs at once:Batch Response
Evaluation Statistics
Get aggregate stats for an agent’s evaluations over time:Listing Evaluations
Custom Criteria
Define custom criteria tailored to your use case:Judge Models
| Model | Speed | Quality | Cost |
|---|---|---|---|
openai/gpt-4o-mini | Fast | Good | Low |
openai/gpt-4o | Medium | Excellent | Medium |
anthropic/claude-sonnet | Medium | Excellent | Medium |
Billing
Evaluations consume tokens from the judge model. Token usage is tracked per evaluation and billed to your account. Check the evaluation response forinput_tokens and output_tokens to understand costs.
Best Practices
- Evaluate regularly — Run batch evaluations weekly to track quality trends.
- Use custom criteria — Default criteria are a good start, but custom criteria aligned with your business goals are more actionable.
- Compare models — Run the same evaluations with different agent models to find the best fit.
- Act on low scores — If a criterion consistently scores below 3, update the agent’s instructions to address it.
- Combine with learnings — When evaluations reveal issues, capture the corrections as learnings so the agent improves.
- Use batch mode — More efficient and cost-effective than running evaluations one at a time.

