AgentEval 0.10.1-beta
dotnet add package AgentEval --version 0.10.1-beta
NuGet\Install-Package AgentEval -Version 0.10.1-beta
<PackageReference Include="AgentEval" Version="0.10.1-beta" />
<PackageVersion Include="AgentEval" Version="0.10.1-beta" />
<PackageReference Include="AgentEval" />
paket add AgentEval --version 0.10.1-beta
#r "nuget: AgentEval, 0.10.1-beta"
#:package AgentEval@0.10.1-beta
#addin nuget:?package=AgentEval&version=0.10.1-beta&prerelease
#tool nuget:?package=AgentEval&version=0.10.1-beta&prerelease
AgentEval
The .NET Evaluation Toolkit for AI Agents
Built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS and DeepEval do for Python, AgentEval does for .NET.
Features
- 🎯 Tool Tracking — Monitor tool/function calls with timing, arguments, and ordering
- ✅ Fluent Assertions — Expressive assertions with rich failure messages,
becausereasons, and assertion scopes - 📊 Performance Metrics — TTFT, latency, tokens, cost estimation for 8+ models
- 🔬 RAG Metrics — Faithfulness, relevance, context precision/recall, answer correctness
- 🛡️ Red Team Security — 9 attack types, 192 probes, OWASP LLM Top 10 coverage
- ⚖️ Responsible AI — Toxicity, bias, and misinformation detection metrics
- 📈 Stochastic Evaluation — Statistical model comparison with multi-run analysis
- 🔄 Trace Record & Replay — Deterministic CI testing without LLM calls
- 🎯 Calibrated Evaluator — Multi-model consensus-driven scoring
- 🔌 Extensible — Adapter pattern for any agent framework
Quick Start
using AgentEval;
using AgentEval.MAF;
using AgentEval.Assertions;
// Create evaluation harness
var harness = new MAFEvaluationHarness(evaluatorClient);
// Run evaluation with tool tracking
var result = await harness.RunEvaluationAsync(agent, new TestCase
{
Name = "Feature Planning Test",
Input = "Plan a user authentication feature",
EvaluationCriteria = ["Should include security considerations"]
});
// Assert tool usage with "because" reasons
result.ToolUsage!
.Should()
.HaveCalledTool("SecurityTool", because: "auth features require security review")
.BeforeTool("FeatureTool")
.WithoutError()
.And()
.HaveNoErrors();
// Assert performance
result.Performance!
.Should()
.HaveTotalDurationUnder(TimeSpan.FromSeconds(10))
.HaveEstimatedCostUnder(0.10m);
Red Team Security Scanning
var result = await AttackPipeline.Create()
.WithAllAttacks()
.ScanAsync(agent);
result.Should().HaveOverallScoreAbove(85);
result.ExportAsync("security-report.sarif", ExportFormat.Sarif);
Trace Record & Replay
Capture agent executions for deterministic replay — no LLM calls needed in CI:
// Record
await using var recorder = new TraceRecordingAgent(realAgent, "weather_test");
var response = await recorder.InvokeAsync("What's the weather?");
await TraceSerializer.SaveToFileAsync(recorder.Trace, "trace.json");
// Replay (deterministic, free)
var trace = await TraceSerializer.LoadFromFileAsync("trace.json");
var replayer = new TraceReplayingAgent(trace);
var replayed = await replayer.InvokeAsync("What's the weather?");
Model Comparison
var result = await comparer.CompareModelsAsync(
factories: [gpt4oFactory, gpt4oMiniFactory],
testCases: testSuite,
options: new ComparisonOptions(RunsPerModel: 5));
Console.WriteLine(result.ToMarkdown());
Quality Assurance
- Comprehensive evaluation suite targeting net8.0, net9.0, and net10.0
- All evaluations passing ✅
Installation
dotnet add package AgentEval --prerelease
Single package, modular internals — AgentEval ships as one NuGet package containing 6 focused assemblies:
AgentEval.Abstractions— Public contracts and interfacesAgentEval.Core— Metrics, assertions, comparison, tracingAgentEval.DataLoaders— Data loading and export (JSON, YAML, CSV, JSONL)AgentEval.MAF— Microsoft Agent Framework integrationAgentEval.RedTeam— Security testing (multiple attack types and probes)
Service Registration
// Register all services at once (recommended):
services.AddAgentEvalAll();
// Or register selectively:
services.AddAgentEval(); // Core services only
services.AddAgentEvalDataLoaders(); // DataLoaders + Exporters
services.AddAgentEvalRedTeam(); // Red Team security testing
Documentation
- Getting Started
- Fluent Assertions
- Metrics Reference
- Red Team Security
- Trace Record & Replay
- Stochastic Evaluation
- Architecture
License
MIT License — See LICENSE for details.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- JsonSchema.Net (>= 7.3.4)
- Microsoft.Agents.AI (>= 1.3.0)
- Microsoft.Agents.AI.Workflows (>= 1.3.0)
- Microsoft.Extensions.AI (>= 10.5.0)
- Microsoft.Extensions.AI.Evaluation.Quality (>= 10.5.0)
- Microsoft.Extensions.DependencyInjection (>= 10.0.3)
- OpenTelemetry.Api (>= 1.15.3)
- PdfSharp-MigraDoc (>= 6.2.4)
- QuestPDF (>= 2026.2.4)
- System.Numerics.Tensors (>= 10.0.6)
- YamlDotNet (>= 16.3.0)
-
net8.0
- JsonSchema.Net (>= 7.3.4)
- Microsoft.Agents.AI (>= 1.3.0)
- Microsoft.Agents.AI.Workflows (>= 1.3.0)
- Microsoft.Extensions.AI (>= 10.5.0)
- Microsoft.Extensions.AI.Evaluation.Quality (>= 10.5.0)
- Microsoft.Extensions.DependencyInjection (>= 10.0.3)
- OpenTelemetry.Api (>= 1.15.3)
- PdfSharp-MigraDoc (>= 6.2.4)
- QuestPDF (>= 2026.2.4)
- System.Numerics.Tensors (>= 10.0.6)
- YamlDotNet (>= 16.3.0)
-
net9.0
- JsonSchema.Net (>= 7.3.4)
- Microsoft.Agents.AI (>= 1.3.0)
- Microsoft.Agents.AI.Workflows (>= 1.3.0)
- Microsoft.Extensions.AI (>= 10.5.0)
- Microsoft.Extensions.AI.Evaluation.Quality (>= 10.5.0)
- Microsoft.Extensions.DependencyInjection (>= 10.0.3)
- OpenTelemetry.Api (>= 1.15.3)
- PdfSharp-MigraDoc (>= 6.2.4)
- QuestPDF (>= 2026.2.4)
- System.Numerics.Tensors (>= 10.0.6)
- YamlDotNet (>= 16.3.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories (1)
Showing the top 1 popular GitHub repositories that depend on AgentEval:
| Repository | Stars |
|---|---|
|
AgentEvalHQ/AgentEval
AgentEval is the comprehensive .NET toolkit for AI agent evaluation—tool usage validation, RAG quality metrics, stochastic evaluation, and model comparison—built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS, PromptFoo and DeepEval do for Python, AgentEval does for .NET
|
| Version | Downloads | Last Updated |
|---|---|---|
| 0.10.1-beta | 75 | 5/18/2026 |
| 0.10.0-beta | 76 | 5/17/2026 |
| 0.9.0-beta | 50 | 5/17/2026 |
| 0.8.1-beta | 522 | 4/29/2026 |
| 0.8.0-beta | 63 | 4/28/2026 |
| 0.6.0-beta | 1,100 | 3/5/2026 |
| 0.5.4-beta | 100 | 3/3/2026 |
| 0.5.3-beta | 125 | 3/1/2026 |
| 0.5.2-beta | 96 | 2/28/2026 |
| 0.5.1-beta | 87 | 2/28/2026 |
| 0.4.0-beta | 103 | 2/22/2026 |
| 0.3.0-beta | 143 | 1/25/2026 |
| 0.2.1-beta | 83 | 1/24/2026 |
| 0.2.0-beta | 80 | 1/18/2026 |
| 0.1.1-alpha | 92 | 1/3/2026 |
| 0.1.0-alpha | 84 | 1/3/2026 |
v0.10.1-beta: Samples Consolidation + Generic Renderers. Adds IEvalResultRenderer in AgentEval.Abstractions with HtmlEvalResultRenderer (Core; self-contained HTML; XSS-safe; honest "NOT TESTED" for skipped leaves) and PdfEvalResultRenderer (new AgentEval.Rendering.Pdf project; QuestPDF-backed; cover + per-leaf + audit-chain appendix; embedded in umbrella via PrivateAssets="all"). Consolidates the per-family *.Demo projects into samples/AgentEval.Samples/Benchmarks/ — 10 focused examples in menu group H: Registry Discovery, Performance, Agentic, GDPR, EU AI Act, OWASP, MITRE, LongMemEval, Memory, Report Browser. Every running sample uses a real Azure-backed agent (no stubs, no hardcoded responses), and every grading sample (H3 onward) uses a real LLM judge; H2 Performance is metric-only (latency/throughput/cost) and does not create a judge. Compliance benchmarks probe the agent per scenario; AGENTEVAL_SAMPLES_PRESET=smoke|standard|audit-grade scales runtime; canonical run via FileSystemOutputStore to .agenteval/ (Mission Control + `agenteval doctor` see it) plus sidecar output/{family}/run-{ts}/ (JSON+HTML+PDF). Compliance reporters (GDPR/EuAiAct/OWASP/MITRE) invoked end-to-end with audit-chain anchoring. New OfferToOpenReports prompt + H10 Report Browser sample. MC bare-`dotnet run -- --workspace path` now honours the flag. ADR-017 four conventions still hold. BREAKING: LongMemEval is now real-data-only — `LongMemEvalDataLoader.LoadEmbedded(...)` removed (use `LoadResolved(...)`; catch `LongMemEvalDatasetNotFoundException`); `LongMemEvalDataLoader.ResolveDatasetPath(...)` now throws when a non-empty `explicitPath` or `LONGMEMEVAL_DATASET_PATH` env var points at a missing file (instead of silently falling back to a different dataset). See CHANGELOG.md "Breaking" subsection for migration steps.
v0.10.0-beta: Unified Benchmarks Architecture (ADR-017). Eight benchmark families — Agentic, GDPR, EU AI Act, OWASP LLM Top 10, MITRE ATLAS, LongMemEval (ICLR 2025), Performance, Memory — now share a single discovery surface (`AgentEval.Benchmarks` namespace) and register canonically with `BenchmarkFamilyRegistry`. New: `OwaspBenchmark`, `MitreBenchmark`, `LongMemEvalBenchmark` façades; `bench --list` CLI command; `bench perf {latency,throughput,cost}` CLI subcommand; per-family `bench {family} --help` enumeration. GDPR / EU AI Act benchmarks promoted from `samples/` to `src/AgentEval.Compliance.{Gdpr,EuAiAct}` first-class product assemblies. `PerformanceBenchmark` relocated to `AgentEval.Evals.Performance` with a Convention-2 `EvaluateAsync(EvalInput) → EvalResult` adapter so perf results flow through the standard `.agenteval/` workspace. BREAKING: internal compliance namespaces renamed `AgentEval.GdprBenchmark.*` → `AgentEval.Compliance.Gdpr.*` (same for EuAiAct); `LongMemEvalBenchmark.Full()` now throws when `LONGMEMEVAL_DATASET_PATH` is unset rather than silently degrading to the embedded subset; `PerformanceBenchmark` moves to its own assembly inside the umbrella. Public preset-factory entry points at `AgentEval.Benchmarks.{Family}Benchmark` are unchanged. See CHANGELOG.md for the full Added×6 / Changed×7 / Breaking×3 entry list and ADR-017 for the four durable conventions (namespace, EvaluateAsync adapter, BenchmarkFamilyRegistry, Opus gate-reviews).