AgentEval 0.10.0-beta

This is a prerelease version of AgentEval.

There is a newer prerelease version of this package available.
See the version list below for details.

dotnet add package AgentEval --version 0.10.0-beta

NuGet\Install-Package AgentEval -Version 0.10.0-beta

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="AgentEval" Version="0.10.0-beta" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="AgentEval" Version="0.10.0-beta" />
                    

                            Directory.Packages.props

<PackageReference Include="AgentEval" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add AgentEval --version 0.10.0-beta

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: AgentEval, 0.10.0-beta"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package AgentEval@0.10.0-beta

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=AgentEval&version=0.10.0-beta&prerelease
                    

                            Install as a Cake Addin

#tool nuget:?package=AgentEval&version=0.10.0-beta&prerelease
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

AgentEval

The .NET Evaluation Toolkit for AI Agents

Built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS and DeepEval do for Python, AgentEval does for .NET.

Features

🎯 Tool Tracking — Monitor tool/function calls with timing, arguments, and ordering
✅ Fluent Assertions — Expressive assertions with rich failure messages, because reasons, and assertion scopes
📊 Performance Metrics — TTFT, latency, tokens, cost estimation for 8+ models
🔬 RAG Metrics — Faithfulness, relevance, context precision/recall, answer correctness
🛡️ Red Team Security — 9 attack types, 192 probes, OWASP LLM Top 10 coverage
⚖️ Responsible AI — Toxicity, bias, and misinformation detection metrics
📈 Stochastic Evaluation — Statistical model comparison with multi-run analysis
🔄 Trace Record & Replay — Deterministic CI testing without LLM calls
🎯 Calibrated Evaluator — Multi-model consensus-driven scoring
🔌 Extensible — Adapter pattern for any agent framework

Quick Start

using AgentEval;
using AgentEval.MAF;
using AgentEval.Assertions;

// Create evaluation harness
var harness = new MAFEvaluationHarness(evaluatorClient);

// Run evaluation with tool tracking
var result = await harness.RunEvaluationAsync(agent, new TestCase
{
    Name = "Feature Planning Test",
    Input = "Plan a user authentication feature",
    EvaluationCriteria = ["Should include security considerations"]
});

// Assert tool usage with "because" reasons
result.ToolUsage!
    .Should()
    .HaveCalledTool("SecurityTool", because: "auth features require security review")
        .BeforeTool("FeatureTool")
        .WithoutError()
    .And()
    .HaveNoErrors();

// Assert performance
result.Performance!
    .Should()
    .HaveTotalDurationUnder(TimeSpan.FromSeconds(10))
    .HaveEstimatedCostUnder(0.10m);

Red Team Security Scanning

var result = await AttackPipeline.Create()
    .WithAllAttacks()
    .ScanAsync(agent);

result.Should().HaveOverallScoreAbove(85);
result.ExportAsync("security-report.sarif", ExportFormat.Sarif);

Trace Record & Replay

Capture agent executions for deterministic replay — no LLM calls needed in CI:

// Record
await using var recorder = new TraceRecordingAgent(realAgent, "weather_test");
var response = await recorder.InvokeAsync("What's the weather?");
await TraceSerializer.SaveToFileAsync(recorder.Trace, "trace.json");

// Replay (deterministic, free)
var trace = await TraceSerializer.LoadFromFileAsync("trace.json");
var replayer = new TraceReplayingAgent(trace);
var replayed = await replayer.InvokeAsync("What's the weather?");

Model Comparison

var result = await comparer.CompareModelsAsync(
    factories: [gpt4oFactory, gpt4oMiniFactory],
    testCases: testSuite,
    options: new ComparisonOptions(RunsPerModel: 5));

Console.WriteLine(result.ToMarkdown());

Quality Assurance

Comprehensive evaluation suite targeting net8.0, net9.0, and net10.0
All evaluations passing ✅

Installation

dotnet add package AgentEval --prerelease

Single package, modular internals — AgentEval ships as one NuGet package containing 6 focused assemblies:

AgentEval.Abstractions — Public contracts and interfaces
AgentEval.Core — Metrics, assertions, comparison, tracing
AgentEval.DataLoaders — Data loading and export (JSON, YAML, CSV, JSONL)
AgentEval.MAF — Microsoft Agent Framework integration
AgentEval.RedTeam — Security testing (multiple attack types and probes)

Service Registration

// Register all services at once (recommended):
services.AddAgentEvalAll();

// Or register selectively:
services.AddAgentEval();              // Core services only
services.AddAgentEvalDataLoaders();   // DataLoaders + Exporters
services.AddAgentEvalRedTeam();       // Red Team security testing

Documentation

License

MIT License — See LICENSE for details.

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- JsonSchema.Net (>= 7.3.4)
- Microsoft.Agents.AI (>= 1.3.0)
- Microsoft.Agents.AI.Workflows (>= 1.3.0)
- Microsoft.Extensions.AI (>= 10.5.0)
- Microsoft.Extensions.AI.Evaluation.Quality (>= 10.5.0)
- Microsoft.Extensions.DependencyInjection (>= 10.0.3)
- OpenTelemetry.Api (>= 1.15.3)
- PdfSharp-MigraDoc (>= 6.2.4)
- QuestPDF (>= 2026.2.4)
- System.Numerics.Tensors (>= 10.0.6)
- YamlDotNet (>= 16.3.0)
net8.0
- JsonSchema.Net (>= 7.3.4)
- Microsoft.Agents.AI (>= 1.3.0)
- Microsoft.Agents.AI.Workflows (>= 1.3.0)
- Microsoft.Extensions.AI (>= 10.5.0)
- Microsoft.Extensions.AI.Evaluation.Quality (>= 10.5.0)
- Microsoft.Extensions.DependencyInjection (>= 10.0.3)
- OpenTelemetry.Api (>= 1.15.3)
- PdfSharp-MigraDoc (>= 6.2.4)
- QuestPDF (>= 2026.2.4)
- System.Numerics.Tensors (>= 10.0.6)
- YamlDotNet (>= 16.3.0)
net9.0
- JsonSchema.Net (>= 7.3.4)
- Microsoft.Agents.AI (>= 1.3.0)
- Microsoft.Agents.AI.Workflows (>= 1.3.0)
- Microsoft.Extensions.AI (>= 10.5.0)
- Microsoft.Extensions.AI.Evaluation.Quality (>= 10.5.0)
- Microsoft.Extensions.DependencyInjection (>= 10.0.3)
- OpenTelemetry.Api (>= 1.15.3)
- PdfSharp-MigraDoc (>= 6.2.4)
- QuestPDF (>= 2026.2.4)
- System.Numerics.Tensors (>= 10.0.6)
- YamlDotNet (>= 16.3.0)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.10.1-beta	0	5/18/2026
0.10.0-beta	39	5/17/2026
0.9.0-beta	41	5/17/2026
0.8.1-beta	495	4/29/2026
0.8.0-beta	63	4/28/2026
0.6.0-beta	1,051	3/5/2026
0.5.4-beta	99	3/3/2026
0.5.3-beta	124	3/1/2026
0.5.2-beta	95	2/28/2026
0.5.1-beta	86	2/28/2026
0.4.0-beta	102	2/22/2026
0.3.0-beta	143	1/25/2026
0.2.1-beta	83	1/24/2026
0.2.0-beta	80	1/18/2026
0.1.1-alpha	92	1/3/2026
0.1.0-alpha	84	1/3/2026

v0.10.0-beta: Unified Benchmarks Architecture (ADR-017). Eight benchmark families — Agentic, GDPR, EU AI Act, OWASP LLM Top 10, MITRE ATLAS, LongMemEval (ICLR 2025), Performance, Memory — now share a single discovery surface (`AgentEval.Benchmarks` namespace) and register canonically with `BenchmarkFamilyRegistry`. New: `OwaspBenchmark`, `MitreBenchmark`, `LongMemEvalBenchmark` façades; `bench --list` CLI command; `bench perf {latency,throughput,cost}` CLI subcommand; per-family `bench {family} --help` enumeration. GDPR / EU AI Act benchmarks promoted from `samples/` to `src/AgentEval.Compliance.{Gdpr,EuAiAct}` first-class product assemblies. `PerformanceBenchmark` relocated to `AgentEval.Evals.Performance` with a Convention-2 `EvaluateAsync(EvalInput) → EvalResult` adapter so perf results flow through the standard `.agenteval/` workspace. BREAKING: internal compliance namespaces renamed `AgentEval.GdprBenchmark.*` → `AgentEval.Compliance.Gdpr.*` (same for EuAiAct); `LongMemEvalBenchmark.Full()` now throws when `LONGMEMEVAL_DATASET_PATH` is unset rather than silently degrading to the embedded subset; `PerformanceBenchmark` moves to its own assembly inside the umbrella. Public preset-factory entry points at `AgentEval.Benchmarks.{Family}Benchmark` are unchanged. See CHANGELOG.md for the full Added×6 / Changed×7 / Breaking×3 entry list and ADR-017 for the four durable conventions (namespace, EvaluateAsync adapter, BenchmarkFamilyRegistry, Opus gate-reviews).