Evaluations

What an evaluation does
Metrics
Test question generation
Comparing versions
End-user feedback

Marketplace runs an automatic evaluation on every publish. Evaluations let you see whether a new version improved or regressed quality before opening it up to all end users.

What an evaluation does

When a deployment publishes, Marketplace:

Generates a set of test questions from the connected sources.
Asks the new version to answer each test question.
Scores each response on faithfulness and relevance using an LLM judge.
Records the aggregate result in the version history.

Evaluations run as the final step of publishing. The version becomes active even if the evaluation flags regressions; treat the score as a signal, not a gate.

Metrics

Metric	What it measures
Faithfulness	Whether the answer is grounded in the cited sources.
Relevance	Whether the answer addresses the question.

You can drill into per-question results to see which test cases regressed and what the application returned.

Test question generation

Test questions are generated automatically from the connected content; they are not curated by hand. To get more representative tests, keep the connected sources focused on the domain the application serves.

Comparing versions

The deployment dashboard shows evaluation results per version. Use the comparison view to see:

Aggregate score deltas between versions.
Per-question pass and fail changes.
Sources cited per response.

If a publish regresses, roll back from the version history. See Manage versions and rollback.

End-user feedback

Evaluations measure the application against generated test cases. End-user feedback measures it against real questions. Use both together. See Analytics and event logs.

Versions and rollback Analytics and event logs

Get started

Templates

Build a knowledge application

Connect data sources

Multi-domain orchestration

Publish and operate

Access control

What an evaluation does

Metrics

Test question generation

Comparing versions

End-user feedback

Get started

Templates

Build a knowledge application

Connect data sources

Multi-domain orchestration

Publish and operate

Access control

Documentation Index

​What an evaluation does

​Metrics

​Test question generation

​Comparing versions

​End-user feedback

What an evaluation does

Metrics

Test question generation

Comparing versions

End-user feedback