Evaluating AI Agents in Solace Agent Mesh

Home > Blog > Artificial Intelligence

You’ve built an agent. It seems to work. Someone on your team said “looks good” in a Slack thread. Are you ready to ship it? Not quite. The hard truth about AI agents is that seeming to work and actually working reliably are very different things. This post is about how you close that gap using the built-in evaluation framework in Solace Agent Mesh.

We’ll cover what evaluations are, why they matter, how the framework is structured, and walk through real examples from simple to complex. By the end you’ll be able to write your own test cases and run them against your agents locally.

This post assumes you already have a working Solace Agent Mesh environment. If you’re starting from scratch, check out the getting-started guide first.

What Are Evaluations?

An evaluation (or eval) is a structured, repeatable test of an AI system’s behavior against defined expectations. Think of it as the AI equivalent of an integration test: instead of asserting that a function returns true, you’re asserting that an agent did the right thing given a realistic input.

At its core, an eval answers: given this input, does the system behave the way I expect?

That might mean:

Did the agent respond with a greeting when I said hello?
Did it call the right tool to process a file?
Did it delegate to the right peer agent to complete a task?
Was the final response accurate and complete?

Why Evaluations Matter

AI agents fail differently from traditional software. There’s no stack trace when an agent gives a mediocre answer. No alarm fires when a model update quietly changes how your agent reasons about a task.

Here’s what you’re up against without evals:

Silent degradation: model updates, tool changes, and config drift can all shift agent behavior without breaking anything in an obvious way
Immeasurable variance: LLMs are probabilistic by nature the same prompt will produce different outputs across runs. Limited ability to distinguish acceptable variance from a dangerous one (different tool calls, wrong decisions, dropped steps).
Multi-hop complexity: enterprise agents invoke tools, delegate to peer agents, and process artifacts; every hop is a potential failure point
No shared baseline: without a repeatable test, “it works” means something different to every person on your team

Evals are how you move from vibes to evidence.

Workflow Evaluations

Solace Agent Mesh Workflows can also be evaluated in a similar manner. Keep an eye out for a deeper dive on workflows and workflow evaluations with Solace Agent Mesh

How Evaluations Work in Solace Agent Mesh

Solace Agent Mesh’s evaluation framework is built into the CLI and runs real requests through real agents on a real broker (not mocked simulations). That means the results actually reflect how your system behaves in production.

There are three building blocks:

Test Case: A single JSON file describing one scenario. It includes the prompt, any file attachments (artifacts), which agent to target, and the criteria for a successful outcome.
Test Suite: A JSON file that groups one or more test cases into a single run. It also defines the environment: which agents to start, which LLM models to use, broker connection details, and how many times to run each test.

Evaluation Settings: A configuration block inside the test suite that specifies how to score results. There are three scoring methods, and you can use any combination of them:

Method	How it works	Best for
Tool Match	Checks whether the agent called the tools listed in expected_tools	Verifying correct tool usage
Response Match	ROUGE score comparing the actual response to expected_response	Factual responses, extraction tasks
LLM Evaluator	A separate LLM judges the full interaction against your criterion	Complex, holistic quality assessment

On ROUGE

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) measures word overlap between two pieces of text. It’s a common NLP metric, good for catching when agents produce the right words, but it won’t catch a correct-but-differently-worded answer. That’s where LLM Evaluator picks up the slack.

Walking Through Test Cases

Agent Mesh Vibe Coding

If you build with an AI assistant you can add the Solace Agent Mesh Context server to provide much-needed context to your process of building Solace Agent Mesh evaluations

Let’s look at three test cases of increasing complexity, all from the Solace Agent Mesh repo at tests/evaluation/test_cases/.

Example 1: Hello World

The simplest possible test. Does the agent respond to a greeting?

{
  "test_case_id": "hello_world",
  "category": "Content Generation",
  "description": "A simple test case to check the basic functionality of the system.",
  "query": "Hello, world!",
  "target_agent": "OrchestratorAgent",
  "wait_time": 30,
  "evaluation": {
    "expected_tools": [],
    "expected_response": "Hello! How can I help you today?",
    "criterion": "Evaluate if the agent provides a standard greeting."
  }
}

A few things worth noting here:

target_agent must match the agent’s display name as configured in your Solace Agent Mesh setup
expected_tools is empty, meaning this test doesn’t expect any tool calls, just a conversational response
wait_time is 30 seconds, which is generous for a simple greeting but is fine as a baseline
criterion is the plain-language instruction passed to the LLM Evaluator if you have it enabled

This kind of test is your canary. If it fails, something is fundamentally broken with your agent setup, and you want to know that before anything else.

Example 2: PDF to Markdown (Orchestration)

This test verifies that the orchestrator correctly delegates to a peer agent when asked to process a file.

{
  "test_case_id": "convert_pdf_to_md",
  "category": "Orchestration",
  "description": "A test case to convert a PDF file to markdown.",
  "target_agent": "OrchestratorAgent",
  "query": "Please convert the attached PDF file to markdown using the Markitdown Agent.",
  "artifacts": [
    {
      "type": "file",
      "path": "artifacts/sample.pdf"
    }
  ],
  "wait_time": 120,
  "evaluation": {
    "expected_tools": ["peer_MarkitdownAgent"],
    "expected_response": "I have converted the PDF file to markdown and attached it.",
    "criterion": "Evaluate if the agent successfully uses the MarkitdownAgent to convert the PDF file to a markdown file and confirms task completion."
  }
}

The jump in complexity here is meaningful:

artifacts: a real PDF file is attached, just like a user would upload one in the UI. Paths are relative to the test suite config file location
wait_time is now 120 seconds, reflecting that multi-agent orchestration takes longer
expected_tools includes peer_MarkitdownAgent, which is how the orchestrator delegates to a peer agent. If it never calls that peer, the tool match score will be 0
The criterion is more specific: it asks the LLM Evaluator to check for both the delegation and a confirmation of task completion

This test validates an entire agent-to-agent delegation path, not just a single response.

Example 3: CSV Filtering (Tool Usage)

This one tests whether the agent correctly uses a built-in tool to process structured data from an artifact.

{
  "test_case_id": "filter_csv_employees_by_age_and_country",
  "category": "Tool Usage",
  "description": "A test case to filter employees from a CSV file based on age and country.",
  "target_agent": "OrchestratorAgent",
  "query": "From the attached CSV, please list the names of all people who are older or equal to 30 and live in the USA.",
  "artifacts": [
    {
      "type": "file",
      "path": "artifacts/sample.csv"
    }
  ],
  "wait_time": 120,
  "evaluation": {
    "expected_tools": ["extract_content_from_artifact"],
    "expected_response": "The person who is 30 or older and lives in the USA is John Doe.",
    "criterion": "Evaluate if the agent correctly filters the CSV data."
  }
}

This test has a deterministic expected response. There’s one correct answer (John Doe), which makes both the Response Match and LLM Evaluator scores very meaningful here. It also checks that the agent uses extract_content_from_artifact rather than trying to invent a workaround.

Running a Test Suite

Individual test cases don’t run on their own. You group them into a Test Suite and run the whole thing with sam eval.

Here’s a local test suite that covers all three examples above (plus more), tested across multiple LLM models:

{
  "agents": [
    "examples/agents/a2a_agents_example.yaml",
    "examples/agents/a2a_multimodal_example.yaml",
    "examples/agents/orchestrator_example.yaml"
  ],
  "broker": {
    "SOLACE_BROKER_URL_VAR": "SOLACE_BROKER_URL",
    "SOLACE_BROKER_USERNAME_VAR": "SOLACE_BROKER_USERNAME",
    "SOLACE_BROKER_PASSWORD_VAR": "SOLACE_BROKER_PASSWORD",
    "SOLACE_BROKER_VPN_VAR": "SOLACE_BROKER_VPN"
  },
  "llm_models": [
    {
      "name": "azure-gpt-4o",
      "env": {
        "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/azure-gpt-4o",
        "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
        "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
      }
    },
    {
      "name": "gemini-3-pro-preview",
      "env": {
        "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/gemini-3-pro-preview",
        "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
        "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
      }
    }
  ],
  "results_dir_name": "my-eval-run",
  "runs": 3,
  "workers": 4,
  "test_cases": [
    "tests/evaluation/test_cases/hello_world.test.json",
    "tests/evaluation/test_cases/convert_pdf_to_md.test.json",
"tests/evaluation/test_cases/filter_csv_employees_by_age_and_country.test.json"
  ],
  "evaluation_settings": {
    "tool_match": { "enabled": true },
    "response_match": { "enabled": true },
    "llm_evaluator": {
      "enabled": true,
      "env": {
        "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/gemini-3-pro-preview",
        "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
        "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
      }
    }
  }
}

A few things worth calling out:

agents lists the YAML configs for any agents the framework needs to start locally. This is what makes it a local eval run vs. a remote one
runs: 3 runs each test case 3 times. Agent behavior has variance; a single run isn’t enough to draw conclusions
workers: 4 runs tests in parallel to keep things fast
The llm_evaluator uses a different model than the agents under test. This is intentional: your judge shouldn’t be the same model as the one being judged

To run it:

sam eval tests/evaluation/my_suite.json
# or with verbose output to see full message traces
sam eval tests/evaluation/my_suite.json --verbose

The framework starts your agents, submits each test case via the Solace broker, collects responses, scores everything, and writes results to the results/ directory. You get a per-test JSON summary and an HTML report with visual charts.

Local vs. Remote

The suite above runs agents locally, which is great for development and model comparisons. If you want to evaluate against an already-running Solace Agent Mesh deployment (staging, production), swap agents and llm_models for a remote block pointing at your instance’s REST gateway. The test cases themselves stay exactly the same.

Model Comparison Evals

This is where the Solace Agent Mesh eval framework earns its keep. Most eval tools test one model. Solace Agent Mesh runs your entire test suite against every model you specify in a single command, then renders a side-by-side comparison in the HTML report.

This matters because model choice is rarely a one-time decision. Models get updated, costs change, and new models ship that might outperform what you’re currently using for specific tasks.

To compare Claude Sonnet, Claude Opus, and Gemini, add them all to llm_models:

"llm_models": [
  {
    "name": "claude-sonnet-4-6",
    "env": {
      "LLM_SERVICE_PLANNING_MODEL_NAME": "anthropic/claude-sonnet-4-6",
      "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
      "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
    }
  },
  {
    "name": "claude-opus-4-6",
    "env": {
      "LLM_SERVICE_PLANNING_MODEL_NAME": "anthropic/claude-opus-4-6",
      "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
      "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
    }
  },
  {
    "name": "gemini-2.5-pro",
    "env": {
      "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/gemini-2.5-pro",
      "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
      "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
    }
  }
]

The framework runs the full suite once per model. The HTML report shows each model’s scores broken down by test category, so you can see, for example, that Opus edges out Sonnet on holistic quality scores for complex orchestration tasks, while Sonnet and Gemini are comparable on straightforward tool usage tests.

Some practical guidance for getting clean comparisons:

Use at least 3 runs per model. One run is not a data point; it’s a coin flip
Keep your LLM Evaluator model separate from the models under test. Pick a stable, strong model you’re not comparing (e.g. GPT-4o or Gemini 2.5 Pro) as the judge
Organize test cases by category: the HTML report groups scores by category>/code>, so you'll get more actionable charts if your categories are meaningful
Watch for variance, not just average scores. A model that scores 0.9 on average but has high variance between runs is less reliable than one that consistently scores 0.8

Reading Your Results

Running sam eval produces two things: a summary printed to the terminal when the run finishes, and a results/ directory with everything stored in detail.

What you get in the terminal

As soon as the run completes you’ll see a table like this, averaged across all runs per test case:

Model                     | Test Case                               | Tool Match | Response Match | LLM Eval
------------------------------------------------------------------------------------------------------------------
bedrock-claude-4-5-sonnet | filter_csv_employees_by_age_and_country | 0.00       | 0.03           | 0.00
bedrock-claude-4-5-sonnet | convert_pdf_to_md                       | 0.00       | 0.04           | 0.00
bedrock-claude-4-5-sonnet | hello_world                             | 1.00       | 0.00           | 0.00
gpt-4-1                   | convert_pdf_to_md                       | 1.00       | 0.25           | 1.00
gpt-4-1                   | hello_world                             | 1.00       | 0.53           | 0.67
gpt-4-1                   | filter_csv_employees_by_age_and_country | 0.00       | 0.34           | 0.30

This gives you a quick read on where things stand. For anything that looks wrong, the full detail is in the results/ directory.

What’s in the results directory

results/sam-local-eval-test/
├── report.html               # Visual HTML report
├── stats.json                # Aggregated scores for all models and test cases
├── gpt-4-1/
│   ├── results.json          # Per-run scores and LLM judge reasoning
│   ├── task_mappings.json
│   ├── hello_world/
│   │   ├── run_1/
│   │   ├── run_2/
│   │   └── run_3/
│   └── convert_pdf_to_md/
│       └── ...
└── bedrock-claude-4-5-sonnet/
└── ...

The HTML report is where you’ll spend most of your time. It shows benchmark run info, which models were tested, and LLM evaluation scores broken down by task category. Here’s what ours looked like:

The per-model results.json>/code> files go deeper: every run gets its own scores plus the full LLM judge reasoning as plain text. That reasoning is genuinely useful when a score surprises you.

Interpreting what the scores are telling you

The three scores serve different purposes and should be read together, not in isolation.

Tool Match is binary and unforgiving. Either the agent called the expected tool or it didn’t. A score of 0.00 means the tool was never used across any run. For bedrock-claude-4-5-sonnet, both convert_pdf_to_md and filter_csv_employees_by_age_and_country scored 0.00 on tool match, which tells us immediately that the model failed to delegate to peer_MarkitdownAgent and failed to call extract_content_from_artifact. Something is broken at the instruction-following level for this model in this agent configuration.
Response Match (ROUGE) is a weaker signal that can mislead. Notice that gpt-4-1 on hello_world scores 0.53 for response match but 0.67 for LLM eval. The agent responded with “Hello! How can I assist you today?” rather than “Hello! How can I help you today?” — the word “assist” vs “help” tanks the ROUGE score even though the response is perfectly correct. This is exactly why ROUGE alone is not enough: it measures word overlap, not meaning. Always cross-reference with LLM eval.
LLM Evaluator gives you the most actionable signal, and the reasoning explains why. For gpt-4-1 on filter_csv_employees_by_age_and_country, the three runs scored 0.0, 0.6, and 0.3 respectively (average 0.30). The LLM judge’s reasoning on the 0.6 run explains it well:
“The agent ultimately provides the correct answer (John Doe). However, the response is confusing and initially incorrect. The agent’s first sentence is factually wrong… it then contradicts itself…” That’s not a pass. The agent eventually got to the right answer but only after stating the wrong one. The LLM judge caught that nuance. ROUGE didn’t.

Low Response Match doesn’t always mean failure

A 0.25 response match on convert_pdf_to_md alongside a 1.00 LLM eval score is normal. The agent returned a confirmation message worded differently from the expected_response, but the LLM judge correctly recognized that it successfully completed the task. Set your expected_response to represent the spirit of the answer, and let LLM eval handle the quality judgment.

Using results to build confidence over time

A single eval run gives you a snapshot. Running evals consistently over time is what gives you confidence.

After agent changes: any time you modify agent instructions, add a tool, or change how an agent delegates, re-run the suite before deploying. A drop in tool match scores is a reliable early warning that your instruction changes broke something in the agent’s reasoning. A drop in LLM eval scores points to response quality degradation.
After model updates: models change under you. If your LLM provider pushes an update to a model you’re using in production, run your eval suite against it before cutting over. The bedrock-claude-4-5-sonnet results here are a good example of this: the model greets correctly (hello_world tool match 00) but completely fails at tool delegation and data processing. Without evals, you wouldn’t know that until a user complained.
Tracking variance: the json file stores score distributions (min, Q1, median, Q3, max) for every test case, not just averages. If a test case scores 1.0, 1.0, 0.0 across three runs, the average is 0.67 but the 0.0 run is a real problem. Watch the distribution, not just the mean. A model that fails 1 in 3 times is not production-ready for that task.
Building a baseline: run your suite when your agents are working well, and save those results as a baseline. Future runs can be compared against it. Scores drifting downward on tasks that used to pass consistently is a signal worth investigating, even if the average is still technically above your threshold.

What “good enough” looks like

For tool-dependent tasks, aim for tool match 1.00 consistently. For LLM eval, a score above 0.8 averaged across 3 runs is a reasonable production bar for most tasks, though you should set your own thresholds based on how critical each task is. For response match, treat anything below 0.5 as a prompt to check the LLM eval score before drawing conclusions.

What’s Next

Once you have a working suite, the natural next step is running it on a cadence: after every agent config change, after a model version update, or as part of your CI pipeline.

Solace Agent Mesh evaluations documentation covers all the configuration options in detail, including remote evaluation mode, artifact types, and full field references.

The example test cases shown in this post (hello_world, convert_pdf_to_md, filter_csv_employees_by_age_and_country, and the rest) are all available in the Solace Agent Mesh GitHub repo. Clone it, run the examples, and start adapting them for your own agents.

Start with hello_world – it’ll tell you more than you expect!

Explore other posts from categories: Artificial Intelligence | For Developers

Jamieson Walker

Jamieson is a software engineer with 8 years of experience spanning embedded software security, technical project management, and solution architecture. He specializes in event-driven architecture (EDA), helping organizations transform legacy systems and implement real-time solutions. Passionate about bridging the gap between complex technical challenges and practical business outcomes through modern event-driven approaches.

See all posts written by Jamieson Walker