Skip to main content
Use this tab to measure routing in practice: the same prompt runs on your chosen baseline model and on a second model in parallel, with LLM-as-judge scoring and an eval history of your last 20 runs per user in the org.

What this tab is for

  • See whether a cheaper routed model stays “good enough” for your prompts.
  • See cost and latency savings when routing applies.
  • Keep a short history of runs (with delete) for demos or regression checks.
The Evaluations tab has two modes selectable at the top:
ModeDescription
Smart RoutingCloudidr automatically picks the comparison model based on your routing strategy. Requires LLM Optimization to be on.
Manual Model ComparisonYou choose both the baseline and comparison models from any provider. Useful for evaluating specific model pairs independently of routing.
Screenshot 2026 05 08 At 21 12 54

LLM Optimization status — required for Smart Routing

A status card shows whether LLM Optimization is on and which strategy applies (e.g. Intra Provider vs Flexible).
Important: If Model Routing is off, Run Eval is disabled in Smart Routing mode — there is no routed path to compare. Turn optimization on under LLM Optimizer Settings, or switch to Manual Model Comparison mode.

Configuration

  • Baseline model — same model picker as Try a Model.
  • Comparison model — in Manual mode, you pick this yourself from any provider. In Smart Routing mode, Cloudidr selects it automatically.
  • Judge model — pick who scores the two answers:
    • Cloudidr (free): Gemma 3 27B or Qwen 3.5 27B or similar model is used — no API key needed.
    • Your provider key: top-tier options per provider, for example:
      • OpenAI: GPT-5.5 Pro / GPT-5.5 / GPT-5.4
      • Anthropic: Claude Opus 4.6 / Claude Sonnet 4.6
      • Google: Gemini 3.1 Pro / Gemini 2.5 Flash
Judge scores are subjective. The UI notes that accuracy can be unreliable for very recent events because models have knowledge cutoffs.

Run Eval

Runs both models in parallel, then runs the judge. Results appear side by side; savings percentage and verdict show below.
Screenshot 2026 05 01 At 21 46 51

When routing does not apply — Smart Routing mode only

The UI explains two cases where no routing substitute is used:
  1. Recency protection — the prompt matched recency signals; Cloudidr keeps the baseline model. You can turn Recency protection off under Optimizer Settings if you accept routing for those prompts.
  2. Complex / no substitute — the prompt is classified as too complex for routing, or no cheaper mapped model exists. Flexible routing does not override the “complex” classification; simplifying the prompt is the practical path.

Verdict and scores

  • Verdict — whether the comparison answer is Better, Equivalent, or Worse (derived from the score delta), or No routing / Too complex to route in Smart Routing mode.
  • Score — overall 1–10 score with a criteria breakdown (Accuracy, Completeness, Clarity, Practical usefulness) when the judge returns structured axes.
  • If the judge fails (API error, parse error), an explicit error message is shown instead of silent neutral scores.

Eval History

Table of recent runs showing: time, prompt snippet, models used, savings percentage, verdict, and score. Expand any row for the full responses, judge reasoning, and criteria breakdown.
  • Delete removes a single run (with confirmation).
  • Bulk delete — select multiple rows with the checkboxes and delete them all at once.
  • At most 20 runs per organization are kept — the oldest run is trimmed automatically when a new one is inserted.