AgentEval 0.10.0-beta
See the version list below for details.
dotnet add package AgentEval --version 0.10.0-beta
NuGet\Install-Package AgentEval -Version 0.10.0-beta
<PackageReference Include="AgentEval" Version="0.10.0-beta" />
<PackageVersion Include="AgentEval" Version="0.10.0-beta" />
<PackageReference Include="AgentEval" />
paket add AgentEval --version 0.10.0-beta
#r "nuget: AgentEval, 0.10.0-beta"
#:package AgentEval@0.10.0-beta
#addin nuget:?package=AgentEval&version=0.10.0-beta&prerelease
#tool nuget:?package=AgentEval&version=0.10.0-beta&prerelease
AgentEval
The .NET Evaluation Toolkit for AI Agents
Built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS and DeepEval do for Python, AgentEval does for .NET.
Features
- π― Tool Tracking β Monitor tool/function calls with timing, arguments, and ordering
- β
Fluent Assertions β Expressive assertions with rich failure messages,
becausereasons, and assertion scopes - π Performance Metrics β TTFT, latency, tokens, cost estimation for 8+ models
- π¬ RAG Metrics β Faithfulness, relevance, context precision/recall, answer correctness
- π‘οΈ Red Team Security β 9 attack types, 192 probes, OWASP LLM Top 10 coverage
- βοΈ Responsible AI β Toxicity, bias, and misinformation detection metrics
- π Stochastic Evaluation β Statistical model comparison with multi-run analysis
- π Trace Record & Replay β Deterministic CI testing without LLM calls
- π― Calibrated Evaluator β Multi-model consensus-driven scoring
- π Extensible β Adapter pattern for any agent framework
Quick Start
using AgentEval;
using AgentEval.MAF;
using AgentEval.Assertions;
// Create evaluation harness
var harness = new MAFEvaluationHarness(evaluatorClient);
// Run evaluation with tool tracking
var result = await harness.RunEvaluationAsync(agent, new TestCase
{
Name = "Feature Planning Test",
Input = "Plan a user authentication feature",
EvaluationCriteria = ["Should include security considerations"]
});
// Assert tool usage with "because" reasons
result.ToolUsage!
.Should()
.HaveCalledTool("SecurityTool", because: "auth features require security review")
.BeforeTool("FeatureTool")
.WithoutError()
.And()
.HaveNoErrors();
// Assert performance
result.Performance!
.Should()
.HaveTotalDurationUnder(TimeSpan.FromSeconds(10))
.HaveEstimatedCostUnder(0.10m);
Red Team Security Scanning
var result = await AttackPipeline.Create()
.WithAllAttacks()
.ScanAsync(agent);
result.Should().HaveOverallScoreAbove(85);
result.ExportAsync("security-report.sarif", ExportFormat.Sarif);
Trace Record & Replay
Capture agent executions for deterministic replay β no LLM calls needed in CI:
// Record
await using var recorder = new TraceRecordingAgent(realAgent, "weather_test");
var response = await recorder.InvokeAsync("What's the weather?");
await TraceSerializer.SaveToFileAsync(recorder.Trace, "trace.json");
// Replay (deterministic, free)
var trace = await TraceSerializer.LoadFromFileAsync("trace.json");
var replayer = new TraceReplayingAgent(trace);
var replayed = await replayer.InvokeAsync("What's the weather?");
Model Comparison
var result = await comparer.CompareModelsAsync(
factories: [gpt4oFactory, gpt4oMiniFactory],
testCases: testSuite,
options: new ComparisonOptions(RunsPerModel: 5));
Console.WriteLine(result.ToMarkdown());
Quality Assurance
- Comprehensive evaluation suite targeting net8.0, net9.0, and net10.0
- All evaluations passing β
Installation
dotnet add package AgentEval --prerelease
Single package, modular internals β AgentEval ships as one NuGet package containing 6 focused assemblies:
AgentEval.Abstractionsβ Public contracts and interfacesAgentEval.Coreβ Metrics, assertions, comparison, tracingAgentEval.DataLoadersβ Data loading and export (JSON, YAML, CSV, JSONL)AgentEval.MAFβ Microsoft Agent Framework integrationAgentEval.RedTeamβ Security testing (multiple attack types and probes)
Service Registration
// Register all services at once (recommended):
services.AddAgentEvalAll();
// Or register selectively:
services.AddAgentEval(); // Core services only
services.AddAgentEvalDataLoaders(); // DataLoaders + Exporters
services.AddAgentEvalRedTeam(); // Red Team security testing
Documentation
- Getting Started
- Fluent Assertions
- Metrics Reference
- Red Team Security
- Trace Record & Replay
- Stochastic Evaluation
- Architecture
License
MIT License β See LICENSE for details.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- JsonSchema.Net (>= 7.3.4)
- Microsoft.Agents.AI (>= 1.3.0)
- Microsoft.Agents.AI.Workflows (>= 1.3.0)
- Microsoft.Extensions.AI (>= 10.5.0)
- Microsoft.Extensions.AI.Evaluation.Quality (>= 10.5.0)
- Microsoft.Extensions.DependencyInjection (>= 10.0.3)
- OpenTelemetry.Api (>= 1.15.3)
- PdfSharp-MigraDoc (>= 6.2.4)
- QuestPDF (>= 2026.2.4)
- System.Numerics.Tensors (>= 10.0.6)
- YamlDotNet (>= 16.3.0)
-
net8.0
- JsonSchema.Net (>= 7.3.4)
- Microsoft.Agents.AI (>= 1.3.0)
- Microsoft.Agents.AI.Workflows (>= 1.3.0)
- Microsoft.Extensions.AI (>= 10.5.0)
- Microsoft.Extensions.AI.Evaluation.Quality (>= 10.5.0)
- Microsoft.Extensions.DependencyInjection (>= 10.0.3)
- OpenTelemetry.Api (>= 1.15.3)
- PdfSharp-MigraDoc (>= 6.2.4)
- QuestPDF (>= 2026.2.4)
- System.Numerics.Tensors (>= 10.0.6)
- YamlDotNet (>= 16.3.0)
-
net9.0
- JsonSchema.Net (>= 7.3.4)
- Microsoft.Agents.AI (>= 1.3.0)
- Microsoft.Agents.AI.Workflows (>= 1.3.0)
- Microsoft.Extensions.AI (>= 10.5.0)
- Microsoft.Extensions.AI.Evaluation.Quality (>= 10.5.0)
- Microsoft.Extensions.DependencyInjection (>= 10.0.3)
- OpenTelemetry.Api (>= 1.15.3)
- PdfSharp-MigraDoc (>= 6.2.4)
- QuestPDF (>= 2026.2.4)
- System.Numerics.Tensors (>= 10.0.6)
- YamlDotNet (>= 16.3.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 0.10.1-beta | 0 | 5/18/2026 |
| 0.10.0-beta | 39 | 5/17/2026 |
| 0.9.0-beta | 41 | 5/17/2026 |
| 0.8.1-beta | 495 | 4/29/2026 |
| 0.8.0-beta | 63 | 4/28/2026 |
| 0.6.0-beta | 1,051 | 3/5/2026 |
| 0.5.4-beta | 99 | 3/3/2026 |
| 0.5.3-beta | 124 | 3/1/2026 |
| 0.5.2-beta | 95 | 2/28/2026 |
| 0.5.1-beta | 86 | 2/28/2026 |
| 0.4.0-beta | 102 | 2/22/2026 |
| 0.3.0-beta | 143 | 1/25/2026 |
| 0.2.1-beta | 83 | 1/24/2026 |
| 0.2.0-beta | 80 | 1/18/2026 |
| 0.1.1-alpha | 92 | 1/3/2026 |
| 0.1.0-alpha | 84 | 1/3/2026 |
v0.10.0-beta: Unified Benchmarks Architecture (ADR-017). Eight benchmark families β Agentic, GDPR, EU AI Act, OWASP LLM Top 10, MITRE ATLAS, LongMemEval (ICLR 2025), Performance, Memory β now share a single discovery surface (`AgentEval.Benchmarks` namespace) and register canonically with `BenchmarkFamilyRegistry`. New: `OwaspBenchmark`, `MitreBenchmark`, `LongMemEvalBenchmark` façades; `bench --list` CLI command; `bench perf {latency,throughput,cost}` CLI subcommand; per-family `bench {family} --help` enumeration. GDPR / EU AI Act benchmarks promoted from `samples/` to `src/AgentEval.Compliance.{Gdpr,EuAiAct}` first-class product assemblies. `PerformanceBenchmark` relocated to `AgentEval.Evals.Performance` with a Convention-2 `EvaluateAsync(EvalInput) β EvalResult` adapter so perf results flow through the standard `.agenteval/` workspace. BREAKING: internal compliance namespaces renamed `AgentEval.GdprBenchmark.*` β `AgentEval.Compliance.Gdpr.*` (same for EuAiAct); `LongMemEvalBenchmark.Full()` now throws when `LONGMEMEVAL_DATASET_PATH` is unset rather than silently degrading to the embedded subset; `PerformanceBenchmark` moves to its own assembly inside the umbrella. Public preset-factory entry points at `AgentEval.Benchmarks.{Family}Benchmark` are unchanged. See CHANGELOG.md for the full Added×6 / Changed×7 / Breaking×3 entry list and ADR-017 for the four durable conventions (namespace, EvaluateAsync adapter, BenchmarkFamilyRegistry, Opus gate-reviews).