Evaluations

Evaluations

Measure and improve your agent's quality with Agent-as-Judge evaluations — automated scoring across relevance, accuracy, helpfulness, and custom criteria.

Evaluations

Evaluations use an AI judge to score your agent's responses across multiple quality criteria. Instead of manually reviewing every conversation, set up automated evaluations that measure relevance, accuracy, helpfulness, and clarity — or define your own custom criteria.

How It Works

Agent Response    →    Judge Agent    →    Scores
                       (GPT-4o-mini)
"Our return            Relevance: 4/5
 policy is..."         Accuracy: 5/5
                       Helpfulness: 4/5
                       Clarity: 5/5
                       Overall: 4.5/5
  1. You provide an input (user query) and output (agent response).
  2. A judge model evaluates the output against scoring criteria.
  3. The judge returns scores for each criterion plus an overall score.
  4. Results are stored and aggregated for trend analysis.

Default Criteria

CriterionDescription
RelevanceHow relevant is the response to the user's query?
AccuracyIs the information provided factually correct?
HelpfulnessHow helpful and actionable is the response?
ClarityIs the response clear and well-structured?

You can override these with custom criteria.

Running Evaluations

From the Dashboard

  1. Go to your agent's Evaluations tab.
  2. Click Run Evaluation.
  3. Enter or select a conversation to evaluate.
  4. Choose the judge model (default: GPT-4o-mini).
  5. Optionally customize scoring criteria.
  6. Click Evaluate.

The results appear with scores for each criterion and an overall rating.

Single Evaluation via API

curl -X POST https://api.thinnest.ai/evaluations/agent_abc123/run \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input_text": "What is your return policy?",
    "output_text": "Our return policy allows you to return any item within 30 days of purchase for a full refund. Items must be in original condition with tags attached. Contact support@example.com to initiate a return.",
    "judge_model": "openai/gpt-4o-mini",
    "criteria": {
      "relevance": "How relevant is the response to the return policy question?",
      "accuracy": "Is the return policy information correct?",
      "helpfulness": "Does the response include actionable next steps?",
      "tone": "Is the tone professional and friendly?"
    }
  }'

Response

{
  "id": 15,
  "agent_id": 42,
  "scores": {
    "relevance": 5,
    "accuracy": 5,
    "helpfulness": 5,
    "tone": 4
  },
  "overall_score": 4.75,
  "reasoning": "The response directly addresses the return policy question with specific details (30-day window, condition requirements). It includes an actionable next step (email address). The tone is professional but could be slightly warmer.",
  "judge_model": "openai/gpt-4o-mini",
  "input_tokens": 245,
  "output_tokens": 180,
  "created_at": "2026-03-06T11:00:00Z"
}

Batch Evaluation

Evaluate multiple input/output pairs at once:

curl -X POST https://api.thinnest.ai/evaluations/evaluate/batch \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "agent_abc123",
    "pairs": [
      {
        "input": "What are your hours?",
        "output": "We are open Monday to Friday, 9 AM to 5 PM EST."
      },
      {
        "input": "How do I cancel my subscription?",
        "output": "To cancel, go to Settings > Billing > Cancel Subscription."
      },
      {
        "input": "Do you offer discounts?",
        "output": "Yes! We offer 20% off for annual subscriptions and student discounts."
      }
    ],
    "judge_model": "openai/gpt-4o-mini"
  }'

Batch Response

{
  "results": [
    { "input": "What are your hours?", "overall_score": 4.5, "scores": {...} },
    { "input": "How do I cancel?", "overall_score": 5.0, "scores": {...} },
    { "input": "Do you offer discounts?", "overall_score": 4.0, "scores": {...} }
  ],
  "summary": {
    "average_score": 4.5,
    "total_evaluated": 3,
    "total_tokens": 890
  }
}

Evaluation Statistics

Get aggregate stats for an agent's evaluations over time:

curl "https://api.thinnest.ai/evaluations/agent_abc123/stats" \
  -H "Authorization: Bearer $THINNESTAI_API_KEY"
{
  "total_evaluations": 156,
  "average_score": 4.2,
  "score_distribution": {
    "5": 45,
    "4": 72,
    "3": 28,
    "2": 8,
    "1": 3
  },
  "by_criteria": {
    "relevance": 4.4,
    "accuracy": 4.1,
    "helpfulness": 4.0,
    "clarity": 4.3
  },
  "trend": "improving"
}

Listing Evaluations

curl "https://api.thinnest.ai/evaluations/agent_abc123/list?limit=20&offset=0" \
  -H "Authorization: Bearer $THINNESTAI_API_KEY"

Custom Criteria

Define custom criteria tailored to your use case:

{
  "criteria": {
    "compliance": "Does the response follow regulatory guidelines?",
    "empathy": "Does the agent show understanding of the customer's situation?",
    "upsell_opportunity": "Does the agent naturally suggest relevant products?",
    "brand_voice": "Does the response match our casual, friendly brand voice?"
  }
}

Each criterion is scored on a 1-5 scale by the judge model.

Judge Models

ModelSpeedQualityCost
openai/gpt-4o-miniFastGoodLow
openai/gpt-4oMediumExcellentMedium
anthropic/claude-sonnetMediumExcellentMedium

The default judge model is GPT-4o-mini, which provides good quality at low cost. For critical evaluations, use GPT-4o or Claude Sonnet.

Billing

Evaluations consume tokens from the judge model. Token usage is tracked per evaluation and billed to your account. Check the evaluation response for input_tokens and output_tokens to understand costs.

Best Practices

  • Evaluate regularly — Run batch evaluations weekly to track quality trends.
  • Use custom criteria — Default criteria are a good start, but custom criteria aligned with your business goals are more actionable.
  • Compare models — Run the same evaluations with different agent models to find the best fit.
  • Act on low scores — If a criterion consistently scores below 3, update the agent's instructions to address it.
  • Combine with learnings — When evaluations reveal issues, capture the corrections as learnings so the agent improves.
  • Use batch mode — More efficient and cost-effective than running evaluations one at a time.

On this page