Evaluations
Measure and improve your agent's quality with Agent-as-Judge evaluations — automated scoring across relevance, accuracy, helpfulness, and custom criteria.
Evaluations
Evaluations use an AI judge to score your agent's responses across multiple quality criteria. Instead of manually reviewing every conversation, set up automated evaluations that measure relevance, accuracy, helpfulness, and clarity — or define your own custom criteria.
How It Works
Agent Response → Judge Agent → Scores
(GPT-4o-mini)
"Our return Relevance: 4/5
policy is..." Accuracy: 5/5
Helpfulness: 4/5
Clarity: 5/5
Overall: 4.5/5- You provide an input (user query) and output (agent response).
- A judge model evaluates the output against scoring criteria.
- The judge returns scores for each criterion plus an overall score.
- Results are stored and aggregated for trend analysis.
Default Criteria
| Criterion | Description |
|---|---|
| Relevance | How relevant is the response to the user's query? |
| Accuracy | Is the information provided factually correct? |
| Helpfulness | How helpful and actionable is the response? |
| Clarity | Is the response clear and well-structured? |
You can override these with custom criteria.
Running Evaluations
From the Dashboard
- Go to your agent's Evaluations tab.
- Click Run Evaluation.
- Enter or select a conversation to evaluate.
- Choose the judge model (default: GPT-4o-mini).
- Optionally customize scoring criteria.
- Click Evaluate.
The results appear with scores for each criterion and an overall rating.
Single Evaluation via API
curl -X POST https://api.thinnest.ai/evaluations/agent_abc123/run \
-H "Authorization: Bearer $THINNESTAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input_text": "What is your return policy?",
"output_text": "Our return policy allows you to return any item within 30 days of purchase for a full refund. Items must be in original condition with tags attached. Contact support@example.com to initiate a return.",
"judge_model": "openai/gpt-4o-mini",
"criteria": {
"relevance": "How relevant is the response to the return policy question?",
"accuracy": "Is the return policy information correct?",
"helpfulness": "Does the response include actionable next steps?",
"tone": "Is the tone professional and friendly?"
}
}'Response
{
"id": 15,
"agent_id": 42,
"scores": {
"relevance": 5,
"accuracy": 5,
"helpfulness": 5,
"tone": 4
},
"overall_score": 4.75,
"reasoning": "The response directly addresses the return policy question with specific details (30-day window, condition requirements). It includes an actionable next step (email address). The tone is professional but could be slightly warmer.",
"judge_model": "openai/gpt-4o-mini",
"input_tokens": 245,
"output_tokens": 180,
"created_at": "2026-03-06T11:00:00Z"
}Batch Evaluation
Evaluate multiple input/output pairs at once:
curl -X POST https://api.thinnest.ai/evaluations/evaluate/batch \
-H "Authorization: Bearer $THINNESTAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"agent_id": "agent_abc123",
"pairs": [
{
"input": "What are your hours?",
"output": "We are open Monday to Friday, 9 AM to 5 PM EST."
},
{
"input": "How do I cancel my subscription?",
"output": "To cancel, go to Settings > Billing > Cancel Subscription."
},
{
"input": "Do you offer discounts?",
"output": "Yes! We offer 20% off for annual subscriptions and student discounts."
}
],
"judge_model": "openai/gpt-4o-mini"
}'Batch Response
{
"results": [
{ "input": "What are your hours?", "overall_score": 4.5, "scores": {...} },
{ "input": "How do I cancel?", "overall_score": 5.0, "scores": {...} },
{ "input": "Do you offer discounts?", "overall_score": 4.0, "scores": {...} }
],
"summary": {
"average_score": 4.5,
"total_evaluated": 3,
"total_tokens": 890
}
}Evaluation Statistics
Get aggregate stats for an agent's evaluations over time:
curl "https://api.thinnest.ai/evaluations/agent_abc123/stats" \
-H "Authorization: Bearer $THINNESTAI_API_KEY"{
"total_evaluations": 156,
"average_score": 4.2,
"score_distribution": {
"5": 45,
"4": 72,
"3": 28,
"2": 8,
"1": 3
},
"by_criteria": {
"relevance": 4.4,
"accuracy": 4.1,
"helpfulness": 4.0,
"clarity": 4.3
},
"trend": "improving"
}Listing Evaluations
curl "https://api.thinnest.ai/evaluations/agent_abc123/list?limit=20&offset=0" \
-H "Authorization: Bearer $THINNESTAI_API_KEY"Custom Criteria
Define custom criteria tailored to your use case:
{
"criteria": {
"compliance": "Does the response follow regulatory guidelines?",
"empathy": "Does the agent show understanding of the customer's situation?",
"upsell_opportunity": "Does the agent naturally suggest relevant products?",
"brand_voice": "Does the response match our casual, friendly brand voice?"
}
}Each criterion is scored on a 1-5 scale by the judge model.
Judge Models
| Model | Speed | Quality | Cost |
|---|---|---|---|
openai/gpt-4o-mini | Fast | Good | Low |
openai/gpt-4o | Medium | Excellent | Medium |
anthropic/claude-sonnet | Medium | Excellent | Medium |
The default judge model is GPT-4o-mini, which provides good quality at low cost. For critical evaluations, use GPT-4o or Claude Sonnet.
Billing
Evaluations consume tokens from the judge model. Token usage is tracked per evaluation and billed to your account. Check the evaluation response for input_tokens and output_tokens to understand costs.
Best Practices
- Evaluate regularly — Run batch evaluations weekly to track quality trends.
- Use custom criteria — Default criteria are a good start, but custom criteria aligned with your business goals are more actionable.
- Compare models — Run the same evaluations with different agent models to find the best fit.
- Act on low scores — If a criterion consistently scores below 3, update the agent's instructions to address it.
- Combine with learnings — When evaluations reveal issues, capture the corrections as learnings so the agent improves.
- Use batch mode — More efficient and cost-effective than running evaluations one at a time.