Evaluations

Evaluations use an AI judge to score your agent’s responses across multiple quality criteria. Instead of manually reviewing every conversation, set up automated evaluations that measure relevance, accuracy, helpfulness, and clarity — or define your own custom criteria.

How It Works

Agent Response    →    Judge Agent    →    Scores
                       (GPT-4o-mini)
"Our return            Relevance: 4/5
 policy is..."         Accuracy: 5/5
                       Helpfulness: 4/5
                       Clarity: 5/5
                       Overall: 4.5/5

You provide an input (user query) and output (agent response).
A judge model evaluates the output against scoring criteria.
The judge returns scores for each criterion plus an overall score.
Results are stored and aggregated for trend analysis.

Default Criteria

Criterion	Description
Relevance	How relevant is the response to the user’s query?
Accuracy	Is the information provided factually correct?
Helpfulness	How helpful and actionable is the response?
Clarity	Is the response clear and well-structured?

You can override these with custom criteria.

Running Evaluations

From the Dashboard

Go to your agent’s Evaluations tab.
Click Run Evaluation.
Enter or select a conversation to evaluate.
Choose the judge model (default: GPT-4o-mini).
Optionally customize scoring criteria.
Click Evaluate.

The results appear with scores for each criterion and an overall rating.

Single Evaluation via API

curl -X POST https://api.thinnest.ai/evaluations/agent_abc123/run \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input_text": "What is your return policy?",
    "output_text": "Our return policy allows you to return any item within 30 days of purchase for a full refund. Items must be in original condition with tags attached. Contact support@example.com to initiate a return.",
    "judge_model": "openai/gpt-4o-mini",
    "criteria": {
      "relevance": "How relevant is the response to the return policy question?",
      "accuracy": "Is the return policy information correct?",
      "helpfulness": "Does the response include actionable next steps?",
      "tone": "Is the tone professional and friendly?"
    }
  }'

Response

{
  "id": 15,
  "agent_id": 42,
  "scores": {
    "relevance": 5,
    "accuracy": 5,
    "helpfulness": 5,
    "tone": 4
  },
  "overall_score": 4.75,
  "reasoning": "The response directly addresses the return policy question with specific details (30-day window, condition requirements). It includes an actionable next step (email address). The tone is professional but could be slightly warmer.",
  "judge_model": "openai/gpt-4o-mini",
  "input_tokens": 245,
  "output_tokens": 180,
  "created_at": "2026-03-06T11:00:00Z"
}

Batch Evaluation

Evaluate multiple input/output pairs at once:

curl -X POST https://api.thinnest.ai/evaluations/evaluate/batch \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "agent_abc123",
    "pairs": [
      {
        "input": "What are your hours?",
        "output": "We are open Monday to Friday, 9 AM to 5 PM EST."
      },
      {
        "input": "How do I cancel my subscription?",
        "output": "To cancel, go to Settings > Billing > Cancel Subscription."
      },
      {
        "input": "Do you offer discounts?",
        "output": "Yes! We offer 20% off for annual subscriptions and student discounts."
      }
    ],
    "judge_model": "openai/gpt-4o-mini"
  }'

Batch Response

{
  "results": [
    { "input": "What are your hours?", "overall_score": 4.5, "scores": {...} },
    { "input": "How do I cancel?", "overall_score": 5.0, "scores": {...} },
    { "input": "Do you offer discounts?", "overall_score": 4.0, "scores": {...} }
  ],
  "summary": {
    "average_score": 4.5,
    "total_evaluated": 3,
    "total_tokens": 890
  }
}

Evaluation Statistics

Get aggregate stats for an agent’s evaluations over time:

curl "https://api.thinnest.ai/evaluations/agent_abc123/stats" \
  -H "Authorization: Bearer $THINNESTAI_API_KEY"

{
  "total_evaluations": 156,
  "average_score": 4.2,
  "score_distribution": {
    "5": 45,
    "4": 72,
    "3": 28,
    "2": 8,
    "1": 3
  },
  "by_criteria": {
    "relevance": 4.4,
    "accuracy": 4.1,
    "helpfulness": 4.0,
    "clarity": 4.3
  },
  "trend": "improving"
}

Listing Evaluations

curl "https://api.thinnest.ai/evaluations/agent_abc123/list?limit=20&offset=0" \
  -H "Authorization: Bearer $THINNESTAI_API_KEY"

Custom Criteria

Define custom criteria tailored to your use case:

{
  "criteria": {
    "compliance": "Does the response follow regulatory guidelines?",
    "empathy": "Does the agent show understanding of the customer's situation?",
    "upsell_opportunity": "Does the agent naturally suggest relevant products?",
    "brand_voice": "Does the response match our casual, friendly brand voice?"
  }
}

Each criterion is scored on a 1-5 scale by the judge model.

Judge Models

Model	Speed	Quality	Cost
`openai/gpt-4o-mini`	Fast	Good	Low
`openai/gpt-4o`	Medium	Excellent	Medium
`anthropic/claude-sonnet`	Medium	Excellent	Medium

The default judge model is GPT-4o-mini, which provides good quality at low cost. For critical evaluations, use GPT-4o or Claude Sonnet.

Billing

Evaluations consume tokens from the judge model. Token usage is tracked per evaluation and billed to your account. Check the evaluation response for input_tokens and output_tokens to understand costs.

Best Practices

Evaluate regularly — Run batch evaluations weekly to track quality trends.
Use custom criteria — Default criteria are a good start, but custom criteria aligned with your business goals are more actionable.
Compare models — Run the same evaluations with different agent models to find the best fit.
Act on low scores — If a criterion consistently scores below 3, update the agent’s instructions to address it.
Combine with learnings — When evaluations reveal issues, capture the corrections as learnings so the agent improves.
Use batch mode — More efficient and cost-effective than running evaluations one at a time.

Introduction

Getting Started

Voice Agents

Agent Capabilities

Channels

Quality & Oversight

Platform

Evaluations

Evaluations

How It Works

Default Criteria

Running Evaluations

From the Dashboard

Single Evaluation via API

Response

Batch Evaluation

Batch Response

Evaluation Statistics

Listing Evaluations

Custom Criteria

Judge Models

Billing

Best Practices

​Evaluations

​How It Works

​Default Criteria

​Running Evaluations

​From the Dashboard

​Single Evaluation via API

​Response

​Batch Evaluation

​Batch Response

​Evaluation Statistics

​Listing Evaluations

​Custom Criteria

​Judge Models

​Billing

​Best Practices

Evaluations

How It Works

Default Criteria

Running Evaluations

From the Dashboard

Single Evaluation via API

Response

Batch Evaluation

Batch Response

Evaluation Statistics

Listing Evaluations

Custom Criteria

Judge Models

Billing

Best Practices