AiDotNet.Tensors
0.70.1
See the version list below for details.
dotnet add package AiDotNet.Tensors --version 0.70.1
NuGet\Install-Package AiDotNet.Tensors -Version 0.70.1
<PackageReference Include="AiDotNet.Tensors" Version="0.70.1" />
<PackageVersion Include="AiDotNet.Tensors" Version="0.70.1" />
<PackageReference Include="AiDotNet.Tensors" />
paket add AiDotNet.Tensors --version 0.70.1
#r "nuget: AiDotNet.Tensors, 0.70.1"
#:package AiDotNet.Tensors@0.70.1
#addin nuget:?package=AiDotNet.Tensors&version=0.70.1
#tool nuget:?package=AiDotNet.Tensors&version=0.70.1
AiDotNet.Tensors
The fastest pure-managed .NET tensor library. Zero external library dependencies — no System.Numerics.Tensors, no MKL, no oneDNN. Every hot path is a hand-written AVX2/AVX-512 SIMD kernel in SimdKernels.cs / SimdGemm.cs / SimdConvHelper.cs. Beats ML.NET, TensorFlow.NET, MathNet, and NumSharp outright on every measured op. Against libtorch (TorchSharp's hand-tuned C++ kernels), wins on Mish 2.3×, Mish (double) 2.2×, GELU (double) 1.6× ahead, Tanh (double) within noise, Tanh (float) 1.4×, TensorMean/Min/Max, MaxPool2D, TensorAdd 100K, and TensorAdd 1M (vs single-thread torch) — all using pure managed C# with hand-tuned AVX2/FMA SIMD kernels and JIT-compiled machine code.
Features
- Zero Allocations: In-place operations with
ArrayPool<T>andSpan<T>for hot paths - Hand-Tuned SIMD: Custom AVX2/FMA kernels with 4x loop unrolling, not just
Vector<T>wrappers - JIT-Compiled Kernels: Runtime x86-64 machine code generation for size-specialized operations
- BLIS-Style GEMM: Tiled matrix multiply with FMA micro-kernel, cache-aware panel packing
- GPU Acceleration: Optional CUDA, HIP/ROCm, and OpenCL support via separate packages
- Multi-Target: Supports .NET 10.0 and .NET Framework 4.7.1
- Generic Math: Works with any numeric type via
INumericOperations<T>interface
Installation
# Core package (CPU SIMD acceleration)
dotnet add package AiDotNet.Tensors
# Optional: OpenBLAS for optimized CPU BLAS operations
dotnet add package AiDotNet.Native.OpenBLAS
# Optional: CLBlast for OpenCL GPU acceleration (AMD/Intel/NVIDIA)
dotnet add package AiDotNet.Native.CLBlast
# Optional: CUDA for NVIDIA GPU acceleration (requires NVIDIA GPU)
dotnet add package AiDotNet.Native.CUDA
Quick Start
using AiDotNet.Tensors.LinearAlgebra;
// Create vectors
var v1 = new Vector<double>(new[] { 1.0, 2.0, 3.0, 4.0 });
var v2 = new Vector<double>(new[] { 5.0, 6.0, 7.0, 8.0 });
// SIMD-accelerated operations
var sum = v1 + v2;
var dot = v1.Dot(v2);
// Create matrices
var m1 = new Matrix<double>(3, 3);
var m2 = Matrix<double>.Identity(3);
// Matrix operations
var product = m1 * m2;
var transpose = m1.Transpose();
CPU Benchmarks
All numbers from the latest BenchmarkDotNet run on AMD Ryzen 9 3950X (16 cores, AVX2/FMA, no AVX-512), .NET 10.0. Reproduce with:
dotnet run -c Release --project tests/AiDotNet.Tensors.Benchmarks --framework net10.0 -- --vs-all
The full per-op result set with error bars lives in tests/AiDotNet.Tensors.Benchmarks/BENCHMARK_RESULTS.md. The summary below is a hand-curated subset.
vs TorchSharp CPU (libtorch C++ backend)
Latest BDN run, post-#209 perf fixes — captured after removing
System.Numerics.Tensors entirely and routing every hot path through
our in-house SimdKernels. All comparisons are eager-vs-eager —
neither side uses torch.compile or AiDotNet compiled plans, so this
is libtorch's hand-rolled C++ kernels against AiDotNet's pure managed
C# + AVX2 SIMD. See
tests/AiDotNet.Tensors.Benchmarks/BENCHMARK_RESULTS.md
for the full per-op table with error bars.
Big wins — AiDotNet beats TorchSharp by 2× or more:
| Operation | Size | AiDotNet | TorchSharp | Speedup |
|---|---|---|---|---|
| Mish | 1M | 377 µs | 884 µs | 2.3× faster |
| Mish (double) | 1M | 1,038 µs | 2,313 µs | 2.2× faster |
Wins — AiDotNet beats TorchSharp:
| Operation | Size | AiDotNet | TorchSharp | Speedup |
|---|---|---|---|---|
| GELU (double) | 1M | 481 µs | 753 µs | 1.6× faster (was 3.6× behind!) |
| Tanh (double) | 1M | 586 µs | 627 µs | 1.07× faster (was 3.3× behind!) |
| Tanh (float) | 1M | 282 µs | 406 µs | 1.4× faster |
| TensorAdd | 100K | 33 µs | 42 µs | 1.3× faster |
| TensorMean | 1M | 189 µs | 243 µs | 1.3× faster |
| TensorAdd | 1M (vs 1-thread torch) | 350 µs | 468 µs | 1.3× vs 1-thread torch |
| MaxPool2D | — | 250 µs | 285 µs | 1.1× faster |
| TensorMin | 1M | 205 µs | 215 µs | within noise (slight win) |
| TensorMultiply | 100K | 37 µs | 39 µs | within noise (slight win) |
Closer-to-parity — AiDotNet within ~1.5× of libtorch:
| Operation | Size | AiDotNet | TorchSharp | Ratio |
|---|---|---|---|---|
| ReLU | 1M | 261 µs | 191 µs | 1.4× |
| Sigmoid | 1M | 326 µs | 223 µs | 1.5× |
| TensorMaxValue | 1M | 195 µs | 189 µs | 1.03× |
| TensorExp | 1M | 296 µs | 306 µs | within noise |
| GELU (float) | 1M | 354 µs | 332 µs | 1.07× |
| TensorSum | 1M | 229 µs | 212 µs | 1.08× |
| TensorAbs | 1M | 362 µs | 221 µs | 1.6× |
| LeakyReLU | 1M | 409 µs | 273 µs | 1.5× |
| Exp (double) | 1M | 753 µs | 284 µs | 2.6× (was 4.3×) |
| Log (double) | 1M | 612 µs | 355 µs | 1.7× (was 16×!) |
This PR's #209 close-parity wins — validated against the pre-fix baseline by fresh BDN re-runs and same-process micro-benchmarks:
| Operation | Pre-fix | Post-fix | Improvement |
|---|---|---|---|
| Softmax_Double 512×1024 | 3,766 µs | 185 µs (slightly AHEAD of torch's 206!) | 20× faster |
| GELU_Double 1M | 2,782 µs | 481 µs (now 1.6× ahead of torch!) | 5.8× faster |
| Tanh_Double 1M | 2,067 µs | 586 µs (within noise of torch) | 3.5× faster |
| Log_Double 1M | 5,785 µs | 612 µs | 9.4× faster |
| Exp_Double 1M | 1,634 µs | 753 µs | 2.2× faster |
| LayerNorm 32k×64 | 1,347 µs | 890 µs | 1.5× faster |
| TensorAdd 1M | 480 µs | 350 µs | 1.4× faster |
| AttentionQKT 512×64 | 599 µs | 419 µs (parallel-M pre-transpose) | 1.4× faster |
| AttentionQKT 512×128 | (not measured) | 451 µs | (149 GFLOPS, parallel-M kernel) |
| MatMul 256³ | 510 µs | 196 µs (parallel-M SgemmDirect) | 2.6× faster |
| MatMul 512³ | 1,074 µs | 930 µs | 1.15× faster |
| Conv2D 1×16×64×64→32 | 458 µs (regressed to 764 with naive 4-oc) | 397 µs (Auto policy picks PerChannel) | back to baseline + 13% |
Residual tracked gaps — areas where libtorch's Intel MKL-DNN (with AVX-512 inner kernels on Intel hardware) still wins. These need multi-day kernel rewrites (single-pass register-resident LayerNorm, fused QKᵀ attention kernel, BLIS-style 6×16 micro-kernel prefetch tuning) and are left as follow-up work:
| Operation | Size | AiDotNet | TorchSharp | Ratio |
|---|---|---|---|---|
| TensorMatMul (float) | 256 | 196 µs (parallel-M SgemmDirect) | 109 µs | 1.8× — was 4.7× |
| TensorMatMul (float) | 512 | 930 µs | 534 µs | 1.7× — was 2.0× |
| LayerNorm | 32k×64 | 890 µs | 303 µs | 2.9× |
| BatchNorm | 32×64×32×32 | 2,201 µs | 745 µs | 3.0× |
| Conv2D (float) | 1×16×64×64→32 | ~397 µs (Auto picks PerChannel) | 310 µs | 1.3× — was 2.3× before A/B fix |
| Conv2D (double) | 4×3×32×32 | 438 µs | 115 µs | 3.8× — unchanged this PR |
| AttentionQKT | 512×64 | 419 µs (parallel-M pre-transpose) | 135 µs | 3.1× — was 4.3× |
| AttentionQKT | 512×128 | 451 µs (parallel-M) | — | 149 GFLOPS, was 1,102 µs |
| Softmax_Double 512×1024 | — | 185 µs | 206 µs | slight win ✓ closed |
Zero-external-dependency policy. Every hot path runs through our
hand-tuned SimdKernels AVX2/AVX-512 implementations. We deliberately
do NOT reference System.Numerics.Tensors, MKL, MKL.NET, or oneDNN —
both for supply-chain hygiene and because we measured several
TensorPrimitives entry points to regress 4–20× vs our in-house kernels
on Ryzen 9 3950X (notably Tanh(float) 20× slower, Sigmoid(double)
12× slower, Log(double) 4× slower). All double-precision and
single-precision paths now go through the same hand-tuned SIMD
kernels — no fallback to any external library.
vs ML.NET (Microsoft.ML, eager-vs-eager)
Latest BDN run, validated post-#209-perf. Microsoft's general-purpose ML framework — same Ryzen 9 3950X, same .NET 10.0.7.
| Operation | Size | AiDotNet | ML.NET | Speedup |
|---|---|---|---|---|
| TensorMean | 1M | 80 µs | 180 µs | 2.2× faster |
| TensorSum | 1M | 92 µs | 104 µs | 1.1× faster |
| TensorAdd | 100K | 106 µs | 55 µs | 0.5× (memory-bound — ML.NET stayed allocator-warm) |
| TensorMultiply | 100K | 106 µs | 60 µs | 0.6× (memory-bound) |
| TensorAdd | 1M | 800 µs | 601 µs | 0.75× (memory-bound) |
| TensorMultiply | 1M | 782 µs | 595 µs | 0.76× (memory-bound) |
The 1M-element bulk ops are memory-bandwidth-bound: at ~50 GB/s sustained DRAM bandwidth on Zen 2, a 4 MB read + 4 MB read + 4 MB write = 12 MB of traffic per call → 240 µs theoretical floor before any allocator overhead. Both libraries are within 2× of that floor.
vs TensorFlow.NET CPU (eager-vs-eager)
Latest BDN run, validated post-#209-perf. SciSharp's TensorFlow .NET binding (eager mode, no graph compile). Same hardware. AiDotNet wins outright on every measured op except small-Conv2D and 256×256 MatMul.
| Operation | Size | AiDotNet | TensorFlow.NET | Speedup |
|---|---|---|---|---|
| TensorSum | 1M | 77 µs | 259 µs | 3.4× faster |
| TensorMean | 1M | 76 µs | 189 µs | 2.5× faster |
| TensorMultiply | 100K | 119 µs | 202 µs | 1.7× faster |
| Sigmoid | 1M | 1,264 µs | 1,941 µs | 1.5× faster |
| TensorAdd | 100K | 141 µs | 211 µs | 1.5× faster |
| TensorMatMul | 512 | 1,286 µs | 1,554 µs | 1.2× faster |
| TensorAdd | 1M | 1,340 µs | 1,478 µs | 1.1× faster |
| ReLU | 1M | 1,680 µs | 1,606 µs | within noise (high stddev 713 µs) |
| TensorMultiply | 1M | 1,655 µs | 1,347 µs | 0.81× (memory-bound) |
| TensorMatMul | 256 | 432 µs | 398 µs | 0.92× |
| Conv2D | 4×3×32×32 | 719 µs | 428 µs | 0.6× |
The fresh validation run captured full data on bulk Add/Multiply +
256/512 MatMul (the original fcb7fea baseline showed NA because
SciSharp's TensorFlow.NET was crashing at those shapes; later runtime
versions stabilized).
vs MathNet.Numerics (Linear Algebra, double, N=1000)
| Operation | AiDotNet | MathNet | Speedup |
|---|---|---|---|
| Matrix Multiply 1000×1000 | 8.3 ms | 49.2 ms | 6× faster |
| Matrix Add | 1.87 ms | 2.50 ms | 1.3× faster |
| Matrix Subtract | 2.08 ms | 2.47 ms | 1.2× faster |
| Matrix Scalar Multiply | 1.66 ms | 2.14 ms | 1.3× faster |
| Transpose | 2.85 ms | 3.68 ms | 1.3× faster |
| Dot Product | 97 ns | 817 ns | 8.4× faster |
| L2 Norm | 92 ns | 11,552 ns | 125× faster |
vs NumSharp (N=1000)
| Operation | AiDotNet | NumSharp | Speedup |
|---|---|---|---|
| Matrix Multiply 1000×1000 | 8.3 ms | 26.5 s | 3,200× faster |
| Matrix Add | 1.87 ms | 1.98 ms | 1.1× faster |
| Transpose | 2.85 ms | 13.7 ms | 4.8× faster |
| Vector Add | 1.47 us | 54.5 us | 37× faster |
vs System.Numerics.Tensors.TensorPrimitives (historical — REMOVED)
We previously referenced System.Numerics.Tensors and benchmarked our
kernels against TensorPrimitives.* directly. As of #209 the dependency
is removed entirely — every elementwise op now runs through our
in-house SimdKernels, both for supply-chain hygiene and because we
measured several TensorPrimitives entry points to regress 4–20× vs our
in-house kernels on Ryzen 9 3950X (notably Tanh(float) ~20× slower,
Sigmoid(double) ~12× slower, Log(double) ~4× slower).
| Operation | AiDotNet | TensorPrimitives (raw) | Speedup |
|---|---|---|---|
| Sigmoid (1M, float) | 284 µs | 7,295 µs | 25× faster |
| TensorAdd (100K, float) | 24 µs | 138 µs | 5.7× faster |
| TensorAdd (1M, float) | 379 µs | 614 µs | 1.6× faster |
| TensorSum (1M, float) | 196 µs | 298 µs | 1.5× faster |
| Dot Product (1K, double, in-place) | 97 ns | 185 ns | 1.9× faster |
| L2 Norm (1K, double, in-place) | 92 ns | 187 ns | 2.0× faster |
Small Matrix Multiply (double)
| Size | AiDotNet | MathNet | NumSharp |
|---|---|---|---|
| 4×4 | 172 ns | 165 ns | 2,198 ns |
| 16×16 | 2.1 us | 2.9 us | 107.5 us |
| 32×32 | 10.5 us | 36.2 us | 774.8 us |
AiDotNet is 1.4× faster at 16×16 and 3.4× faster at 32×32 than MathNet.
SIMD Instruction Support
The library automatically detects and uses the best available SIMD instructions:
| Instruction Set | Vector Width | Supported |
|---|---|---|
| AVX-512 | 512-bit (16 floats) | .NET 8+ |
| AVX2 + FMA | 256-bit (8 floats) | .NET 6+ |
| AVX | 256-bit (8 floats) | .NET 6+ |
| SSE4.2 | 128-bit (4 floats) | .NET 6+ |
| ARM NEON | 128-bit (4 floats) | .NET 6+ |
Check Available Acceleration
using AiDotNet.Tensors.Engines;
var caps = PlatformDetector.Capabilities;
// SIMD capabilities
Console.WriteLine($"AVX2: {caps.HasAVX2}");
Console.WriteLine($"AVX-512: {caps.HasAVX512F}");
// GPU support
Console.WriteLine($"CUDA: {caps.HasCudaSupport}");
Console.WriteLine($"OpenCL: {caps.HasOpenCLSupport}");
// Native library availability
Console.WriteLine($"OpenBLAS: {caps.HasOpenBlas}");
Console.WriteLine($"CLBlast: {caps.HasClBlast}");
// Or get a full status summary
Console.WriteLine(NativeLibraryDetector.GetStatusSummary());
Optional Acceleration Packages
AiDotNet.Native.OpenBLAS
Provides optimized CPU BLAS operations using OpenBLAS:
dotnet add package AiDotNet.Native.OpenBLAS
Performance: Accelerated BLAS operations for matrix multiply and decompositions.
AiDotNet.Native.CLBlast
Provides GPU acceleration via OpenCL (works on AMD, Intel, and NVIDIA GPUs):
dotnet add package AiDotNet.Native.CLBlast
Performance: 10x+ faster for large matrix operations on GPU.
AiDotNet.Native.CUDA
Provides GPU acceleration via NVIDIA CUDA (NVIDIA GPUs only):
dotnet add package AiDotNet.Native.CUDA
Performance: 30,000+ GFLOPS for matrix operations on modern NVIDIA GPUs.
Requirements:
- NVIDIA GPU (GeForce, Quadro, or Tesla)
- NVIDIA display driver 525.60+ (includes CUDA driver)
Usage with helpful error messages:
using AiDotNet.Tensors.Engines.DirectGpu.CUDA;
// Recommended: throws beginner-friendly exception if CUDA unavailable
using var cuda = CudaBackend.CreateOrThrow();
// Or check availability first
if (CudaBackend.IsCudaAvailable)
{
using var backend = new CudaBackend();
// Use CUDA acceleration
}
If CUDA is not available, you'll get detailed troubleshooting steps explaining exactly what's missing and how to fix it.
Requirements
- .NET 10.0 or .NET Framework 4.7.1+
- Windows x64, Linux x64, or macOS x64/arm64
License
Apache 2.0 - See LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
| .NET Framework | net471 is compatible. net472 was computed. net48 was computed. net481 was computed. |
-
.NETFramework 4.7.1
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.0)
- System.Text.Json (>= 8.0.5)
- System.Threading.Channels (>= 8.0.0)
-
net10.0
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.0)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on AiDotNet.Tensors:
| Package | Downloads |
|---|---|
|
AiDotNet
A comprehensive .NET library for machine learning, deep learning, NLP, computer vision, and AI model serving. Licensed under BSL 1.1 — free for non-commercial use, community license available at aidotnet.dev. Model save/load requires a free or paid license after a 10-operation trial. Optional anonymous telemetry (opt-in via AIDOTNET_TELEMETRY=true) collects usage metrics — no PII or model data is collected. |
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 0.72.0 | 0 | 5/5/2026 |
| 0.71.0 | 42 | 5/4/2026 |
| 0.70.2 | 857 | 5/4/2026 |
| 0.70.1 | 67 | 5/3/2026 |
| 0.70.0 | 934 | 5/3/2026 |
| 0.69.5 | 49 | 5/3/2026 |
| 0.69.4 | 799 | 5/2/2026 |
| 0.69.3 | 166 | 5/2/2026 |
| 0.69.2 | 47 | 5/2/2026 |
| 0.69.1 | 371 | 5/1/2026 |
| 0.69.0 | 182 | 5/1/2026 |
| 0.68.0 | 901 | 4/30/2026 |
| 0.67.1 | 75 | 4/30/2026 |
| 0.67.0 | 76 | 4/30/2026 |
| 0.66.0 | 80 | 4/30/2026 |
| 0.65.0 | 86 | 4/29/2026 |
| 0.64.0 | 87 | 4/29/2026 |
| 0.63.0 | 82 | 4/29/2026 |
| 0.62.0 | 85 | 4/29/2026 |
| 0.61.0 | 90 | 4/29/2026 |