This article describes the concepts of Evaluation Sets and Evaluation Runs, which are tools used to test and validate your AI’s domain performance.

Evaluation Sets

In the Evaluation Sets tab, you can create specific sets of prompts that simulate user interactions and provide expected SQL outputs alongside these prompts. This allows for a direct comparison to see how accurately the AI converts natural language into database queries. To create an Evaluation Set:
  1. Open a Domain and click on the Evaluation tab.
  2. Navigate to the Evaluation Sets Sub-tab.
  3. Click Add Evaluation.
Image showing the Evaluations tab
  1. Fill out the Add Evaluation Form, providing a set Name and an array of prompts in JSON format, optionally including the expected SQL.
    The array you provide defines how many conversations will be created. Each top-level element in the array represents a conversation:✔ If it’s a string or an object → You get a single-message conversation
    ✔ If it’s an array → You create a multi-message conversation
    For example, this input:
    ["prompt1", ["prompt2.1", "prompt2.2"], "prompt3"]
    creates 3 conversations: the first and third with one message each, and the second with two messages.
    You can structure the arrays in any way, mixing single and multi-message conversations as required.
  2. Click on Save. The new Evaluation set will appear listed.
Image showing the Add Evaluations form

Evaluation Runs

Evaluation Runs are where the AI processes your defined Evaluation Sets. After running these evaluations, you can review the results to identify areas for improvement. To Run an Evaluation, go to the Evaluations Set tab, select an evaluation and click on Run. You will see the results in the Evaluation Runs tab. Run Evaluation Pn

Evaluation Run Indicators

Evaluation Run Indicators provide a concise overview of the run’s progress and outcome. These indicators offer immediate feedback on the evaluation’s status and score, detailing how well the AI’s generated responses matched the expected results. They are:
  • Status: Signals its progress or completion.
    • Running: The evaluation run is currently in progress.
    • Completed: The evaluation run has finished successfully.
  • Score: Reflects the result of the completed evaluation. It tells you how many conversations passed based on the predefined evaluation criteria (i.e. Prompt + Expected SQL result added in the Evaluation set modal).
The Score evaluation criteria refers to how much the generated answer matches the expected SQL result defined previously.
Evaluation Runs Indicatorsv2 Pn

Evaluation Run Report

When you click the View Report option, you will see comprehensive details about how an evaluation run performed. Here’s a breakdown of the information you’ll find:
  • View Domain: A link that allows you to navigate to the specific domain that was evaluated.
  • Soft Match: This Score indicates the overall performance of the evaluation. It shows how many of the evaluation criteria were successfully met out of the total. It is named as Soft Match since it is possible that results are consider a match even if they are not exactly the same as the expected (provided) SQL.
  • Individual Session Details: The report organizes the evaluation results by individual sessions or queries (e.g., Session 1, Session 2).
View Report 01v2 Gi For each session, the report is broken down into several components, allowing you to thoroughly analyze the AI’s performance.
  • Session Title: This states the query or task that was evaluated for that particular session (e.g., “Calculate total revenue for closed won opportunities using ACV.”).
  • Evaluation Details: This expandable section provides specific insights into the session’s outcome. The core components are detailed in the table below.
ComponentDescription
PromptThe specific input prompt that was used for the session.
Manual ScoreAn option for you to manually score the evaluation (✅ or ❌), which overrides the automated score.
Automated ScoreThe score automatically assigned by the system. This corresponds to the Soft Match.
Ground TruthThe expected or correct outcome, typically the ideal SQL query (SELECT SUM(amount) FROM Opportunity).
Generated ResultThe output produced by the system, including the generated SQL query and the final result (e.g., "$137.55M").
Gif showing the Session Details
Providing an incorrect or inaccurate SQL code will lead to a syntax or semantic error, and the system will provide a message in red detailing the nature of the error, often including the location within the query where the problem occurred, to aid in correction.Example showing an Incorrect Sql in Red
  • Conversation: You can expand this section to review the entire conversational exchange related to that specific session. This includes:
    • AI Workstream: Offers a look into the AI’s process, showing the tool it selected, the examples it referenced, and the step-by-step plan it followed to generate the response.
    • Reviewed Status: Confirms whether the response has been reviewed and summarizes the outcome (e.g., “The user received a complete response… so no further action or information is needed.”).
Gif showing the Conversation in Detail Navigating the report helps you understand the system’s processing, verify generated results, and review the underlying conversational logic.

Next Steps