Test Suites

Copy page

Build fixed test cases, run agents against them, and evaluate outputs using test suites in the Visual Builder

A test suite is a named collection of items: each item supplies the messages sent to an agent (the input) and optionally the expected output you want to compare against. In the Visual Builder, test suites appear under Test Suites in the project sidebar.

Test suite runs execute those items against one or more agents, create conversations from each item, and can attach evaluators to score the results.

When you start a run with evaluators selected, the platform also creates a batch evaluation job scoped to that run’s conversations.

Where to find test suites

  1. Open your project in the Visual Builder.
  2. In the project sidebar, choose Test Suites.
Note
Note

You need Edit permission on the project to create or change test suites, items, run configurations, and to start runs. See Access control.

Create a test suite

From the Test Suites list, create a new suite and give it a name. The suite is empty until you add items.

Test suite items

Each item has:

  • Input — JSON object with a messages array. Each message has a role (user, assistant, or system) and content in the same shape as chat messages elsewhere in the product (for example text strings, or parts for richer content). This is what the agent sees when the item is run.
  • Expected output (optional) — JSON array of messages with the same role/content shape. Use it to record the reference reply you care about; evaluators or your own tooling can compare model output to this.
Create Test Suite Item dialog showing input messages and optional expected output

Bulk upload from CSV

From the Items tab on a test suite, choose Upload CSV to create many items at once. The file must be UTF-8 and include a header row.

Recognized headers (case-insensitive):

  • input (required)
  • expectedOutput (optional)

Each cell can contain either:

  • Plain text — a single-turn input becomes { messages: [{ role: 'user', content: '<text>' }] }, and a plain-text expected output becomes [{ role: 'assistant', content: '<text>' }].
  • JSON — paste the full input/expectedOutput shape directly when you need multi-turn conversations, system prompts, or structured parts content.
input,expectedOutput
"What is 2+2?","4"
"{""messages"":[{""role"":""system"",""content"":""Be terse.""},{""role"":""user"",""content"":""Ping""}]}","[{""role"":""assistant"",""content"":""Pong""}]"

Rows with missing input or invalid JSON are skipped and reported before upload so you can fix them and re-try. Successfully parsed rows are sent to POST .../dataset-items/{datasetId}/items/bulk in a single request.

Agents

You can link agents to a test suite so you can filter or scope which agents are associated with that suite (for example when choosing who runs the items). Run configurations still declare which agents actually execute a given run.

Run configurations and runs

A run configuration ties a test suite to:

  • One or more agents that will each process every item (each item × agent produces a run invocation).
  • Optional evaluators to run on the resulting conversations.

Create a run configuration from the test suite detail page (Runs tab). When you start a run, the platform creates a test suite run and processes items. You need at least one item and at least one agent on the configuration before a run can start.

Create Test Suite Run dialog showing name, description, agent selection, and optional evaluators

Open a run to see per-item invocations, conversation links, and evaluation output when evaluators are configured.

Rerun a past run

Each row on the Runs tab has a Rerun action, and the run detail page exposes a Rerun button in the header. Triggering a rerun creates a new row in the runs table using:

  • The current items in the test suite (any items you've added since the source run are included; deleted items are skipped).
  • The same run configuration (agents, name) as the source run.
  • The same evaluators that were attached to the source run, unless you pass overrides.

The rerun endpoint is POST .../dataset-runs/{runId}/rerun. The response includes the new datasetRunId so you can navigate straight to the in-progress run. Runs that were not created from a run configuration can't be rerun — the button is disabled for those rows.

Programmatic access

SurfaceUse for
Evaluations API referenceCRUD test suites and items, agent links, run configs, trigger runs (POST .../dataset-run-configs/{id}/run), list runs and results
TypeScript SDK: EvaluationsEvaluationClient helpers (listDatasets, createDataset, createDatasetItem, createDatasetItems, etc.)

Listing test suites supports an optional agentId query parameter on the list endpoint to restrict results to suites linked to that agent.

On this page