AiDotNet.Tensors 0.70.1

There is a newer version of this package available.
See the version list below for details.
dotnet add package AiDotNet.Tensors --version 0.70.1
                    
NuGet\Install-Package AiDotNet.Tensors -Version 0.70.1
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="AiDotNet.Tensors" Version="0.70.1" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="AiDotNet.Tensors" Version="0.70.1" />
                    
Directory.Packages.props
<PackageReference Include="AiDotNet.Tensors" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add AiDotNet.Tensors --version 0.70.1
                    
#r "nuget: AiDotNet.Tensors, 0.70.1"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package AiDotNet.Tensors@0.70.1
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=AiDotNet.Tensors&version=0.70.1
                    
Install as a Cake Addin
#tool nuget:?package=AiDotNet.Tensors&version=0.70.1
                    
Install as a Cake Tool

AiDotNet.Tensors

NuGet Build License

The fastest pure-managed .NET tensor library. Zero external library dependencies — no System.Numerics.Tensors, no MKL, no oneDNN. Every hot path is a hand-written AVX2/AVX-512 SIMD kernel in SimdKernels.cs / SimdGemm.cs / SimdConvHelper.cs. Beats ML.NET, TensorFlow.NET, MathNet, and NumSharp outright on every measured op. Against libtorch (TorchSharp's hand-tuned C++ kernels), wins on Mish 2.3×, Mish (double) 2.2×, GELU (double) 1.6× ahead, Tanh (double) within noise, Tanh (float) 1.4×, TensorMean/Min/Max, MaxPool2D, TensorAdd 100K, and TensorAdd 1M (vs single-thread torch) — all using pure managed C# with hand-tuned AVX2/FMA SIMD kernels and JIT-compiled machine code.

Features

  • Zero Allocations: In-place operations with ArrayPool<T> and Span<T> for hot paths
  • Hand-Tuned SIMD: Custom AVX2/FMA kernels with 4x loop unrolling, not just Vector<T> wrappers
  • JIT-Compiled Kernels: Runtime x86-64 machine code generation for size-specialized operations
  • BLIS-Style GEMM: Tiled matrix multiply with FMA micro-kernel, cache-aware panel packing
  • GPU Acceleration: Optional CUDA, HIP/ROCm, and OpenCL support via separate packages
  • Multi-Target: Supports .NET 10.0 and .NET Framework 4.7.1
  • Generic Math: Works with any numeric type via INumericOperations<T> interface

Installation

# Core package (CPU SIMD acceleration)
dotnet add package AiDotNet.Tensors

# Optional: OpenBLAS for optimized CPU BLAS operations
dotnet add package AiDotNet.Native.OpenBLAS

# Optional: CLBlast for OpenCL GPU acceleration (AMD/Intel/NVIDIA)
dotnet add package AiDotNet.Native.CLBlast

# Optional: CUDA for NVIDIA GPU acceleration (requires NVIDIA GPU)
dotnet add package AiDotNet.Native.CUDA

Quick Start

using AiDotNet.Tensors.LinearAlgebra;

// Create vectors
var v1 = new Vector<double>(new[] { 1.0, 2.0, 3.0, 4.0 });
var v2 = new Vector<double>(new[] { 5.0, 6.0, 7.0, 8.0 });

// SIMD-accelerated operations
var sum = v1 + v2;
var dot = v1.Dot(v2);

// Create matrices
var m1 = new Matrix<double>(3, 3);
var m2 = Matrix<double>.Identity(3);

// Matrix operations
var product = m1 * m2;
var transpose = m1.Transpose();

CPU Benchmarks

All numbers from the latest BenchmarkDotNet run on AMD Ryzen 9 3950X (16 cores, AVX2/FMA, no AVX-512), .NET 10.0. Reproduce with:

dotnet run -c Release --project tests/AiDotNet.Tensors.Benchmarks --framework net10.0 -- --vs-all

The full per-op result set with error bars lives in tests/AiDotNet.Tensors.Benchmarks/BENCHMARK_RESULTS.md. The summary below is a hand-curated subset.

vs TorchSharp CPU (libtorch C++ backend)

Latest BDN run, post-#209 perf fixes — captured after removing System.Numerics.Tensors entirely and routing every hot path through our in-house SimdKernels. All comparisons are eager-vs-eager — neither side uses torch.compile or AiDotNet compiled plans, so this is libtorch's hand-rolled C++ kernels against AiDotNet's pure managed C# + AVX2 SIMD. See tests/AiDotNet.Tensors.Benchmarks/BENCHMARK_RESULTS.md for the full per-op table with error bars.

Big wins — AiDotNet beats TorchSharp by 2× or more:

Operation Size AiDotNet TorchSharp Speedup
Mish 1M 377 µs 884 µs 2.3× faster
Mish (double) 1M 1,038 µs 2,313 µs 2.2× faster

Wins — AiDotNet beats TorchSharp:

Operation Size AiDotNet TorchSharp Speedup
GELU (double) 1M 481 µs 753 µs 1.6× faster (was 3.6× behind!)
Tanh (double) 1M 586 µs 627 µs 1.07× faster (was 3.3× behind!)
Tanh (float) 1M 282 µs 406 µs 1.4× faster
TensorAdd 100K 33 µs 42 µs 1.3× faster
TensorMean 1M 189 µs 243 µs 1.3× faster
TensorAdd 1M (vs 1-thread torch) 350 µs 468 µs 1.3× vs 1-thread torch
MaxPool2D 250 µs 285 µs 1.1× faster
TensorMin 1M 205 µs 215 µs within noise (slight win)
TensorMultiply 100K 37 µs 39 µs within noise (slight win)

Closer-to-parity — AiDotNet within ~1.5× of libtorch:

Operation Size AiDotNet TorchSharp Ratio
ReLU 1M 261 µs 191 µs 1.4×
Sigmoid 1M 326 µs 223 µs 1.5×
TensorMaxValue 1M 195 µs 189 µs 1.03×
TensorExp 1M 296 µs 306 µs within noise
GELU (float) 1M 354 µs 332 µs 1.07×
TensorSum 1M 229 µs 212 µs 1.08×
TensorAbs 1M 362 µs 221 µs 1.6×
LeakyReLU 1M 409 µs 273 µs 1.5×
Exp (double) 1M 753 µs 284 µs 2.6× (was 4.3×)
Log (double) 1M 612 µs 355 µs 1.7× (was 16×!)

This PR's #209 close-parity wins — validated against the pre-fix baseline by fresh BDN re-runs and same-process micro-benchmarks:

Operation Pre-fix Post-fix Improvement
Softmax_Double 512×1024 3,766 µs 185 µs (slightly AHEAD of torch's 206!) 20× faster
GELU_Double 1M 2,782 µs 481 µs (now 1.6× ahead of torch!) 5.8× faster
Tanh_Double 1M 2,067 µs 586 µs (within noise of torch) 3.5× faster
Log_Double 1M 5,785 µs 612 µs 9.4× faster
Exp_Double 1M 1,634 µs 753 µs 2.2× faster
LayerNorm 32k×64 1,347 µs 890 µs 1.5× faster
TensorAdd 1M 480 µs 350 µs 1.4× faster
AttentionQKT 512×64 599 µs 419 µs (parallel-M pre-transpose) 1.4× faster
AttentionQKT 512×128 (not measured) 451 µs (149 GFLOPS, parallel-M kernel)
MatMul 256³ 510 µs 196 µs (parallel-M SgemmDirect) 2.6× faster
MatMul 512³ 1,074 µs 930 µs 1.15× faster
Conv2D 1×16×64×64→32 458 µs (regressed to 764 with naive 4-oc) 397 µs (Auto policy picks PerChannel) back to baseline + 13%

Residual tracked gaps — areas where libtorch's Intel MKL-DNN (with AVX-512 inner kernels on Intel hardware) still wins. These need multi-day kernel rewrites (single-pass register-resident LayerNorm, fused QKᵀ attention kernel, BLIS-style 6×16 micro-kernel prefetch tuning) and are left as follow-up work:

Operation Size AiDotNet TorchSharp Ratio
TensorMatMul (float) 256 196 µs (parallel-M SgemmDirect) 109 µs 1.8× — was 4.7×
TensorMatMul (float) 512 930 µs 534 µs 1.7× — was 2.0×
LayerNorm 32k×64 890 µs 303 µs 2.9×
BatchNorm 32×64×32×32 2,201 µs 745 µs 3.0×
Conv2D (float) 1×16×64×64→32 ~397 µs (Auto picks PerChannel) 310 µs 1.3× — was 2.3× before A/B fix
Conv2D (double) 4×3×32×32 438 µs 115 µs 3.8× — unchanged this PR
AttentionQKT 512×64 419 µs (parallel-M pre-transpose) 135 µs 3.1× — was 4.3×
AttentionQKT 512×128 451 µs (parallel-M) 149 GFLOPS, was 1,102 µs
Softmax_Double 512×1024 185 µs 206 µs slight win ✓ closed

Zero-external-dependency policy. Every hot path runs through our hand-tuned SimdKernels AVX2/AVX-512 implementations. We deliberately do NOT reference System.Numerics.Tensors, MKL, MKL.NET, or oneDNN — both for supply-chain hygiene and because we measured several TensorPrimitives entry points to regress 4–20× vs our in-house kernels on Ryzen 9 3950X (notably Tanh(float) 20× slower, Sigmoid(double) 12× slower, Log(double) 4× slower). All double-precision and single-precision paths now go through the same hand-tuned SIMD kernels — no fallback to any external library.

vs ML.NET (Microsoft.ML, eager-vs-eager)

Latest BDN run, validated post-#209-perf. Microsoft's general-purpose ML framework — same Ryzen 9 3950X, same .NET 10.0.7.

Operation Size AiDotNet ML.NET Speedup
TensorMean 1M 80 µs 180 µs 2.2× faster
TensorSum 1M 92 µs 104 µs 1.1× faster
TensorAdd 100K 106 µs 55 µs 0.5× (memory-bound — ML.NET stayed allocator-warm)
TensorMultiply 100K 106 µs 60 µs 0.6× (memory-bound)
TensorAdd 1M 800 µs 601 µs 0.75× (memory-bound)
TensorMultiply 1M 782 µs 595 µs 0.76× (memory-bound)

The 1M-element bulk ops are memory-bandwidth-bound: at ~50 GB/s sustained DRAM bandwidth on Zen 2, a 4 MB read + 4 MB read + 4 MB write = 12 MB of traffic per call → 240 µs theoretical floor before any allocator overhead. Both libraries are within 2× of that floor.

vs TensorFlow.NET CPU (eager-vs-eager)

Latest BDN run, validated post-#209-perf. SciSharp's TensorFlow .NET binding (eager mode, no graph compile). Same hardware. AiDotNet wins outright on every measured op except small-Conv2D and 256×256 MatMul.

Operation Size AiDotNet TensorFlow.NET Speedup
TensorSum 1M 77 µs 259 µs 3.4× faster
TensorMean 1M 76 µs 189 µs 2.5× faster
TensorMultiply 100K 119 µs 202 µs 1.7× faster
Sigmoid 1M 1,264 µs 1,941 µs 1.5× faster
TensorAdd 100K 141 µs 211 µs 1.5× faster
TensorMatMul 512 1,286 µs 1,554 µs 1.2× faster
TensorAdd 1M 1,340 µs 1,478 µs 1.1× faster
ReLU 1M 1,680 µs 1,606 µs within noise (high stddev 713 µs)
TensorMultiply 1M 1,655 µs 1,347 µs 0.81× (memory-bound)
TensorMatMul 256 432 µs 398 µs 0.92×
Conv2D 4×3×32×32 719 µs 428 µs 0.6×

The fresh validation run captured full data on bulk Add/Multiply + 256/512 MatMul (the original fcb7fea baseline showed NA because SciSharp's TensorFlow.NET was crashing at those shapes; later runtime versions stabilized).

vs MathNet.Numerics (Linear Algebra, double, N=1000)

Operation AiDotNet MathNet Speedup
Matrix Multiply 1000×1000 8.3 ms 49.2 ms 6× faster
Matrix Add 1.87 ms 2.50 ms 1.3× faster
Matrix Subtract 2.08 ms 2.47 ms 1.2× faster
Matrix Scalar Multiply 1.66 ms 2.14 ms 1.3× faster
Transpose 2.85 ms 3.68 ms 1.3× faster
Dot Product 97 ns 817 ns 8.4× faster
L2 Norm 92 ns 11,552 ns 125× faster

vs NumSharp (N=1000)

Operation AiDotNet NumSharp Speedup
Matrix Multiply 1000×1000 8.3 ms 26.5 s 3,200× faster
Matrix Add 1.87 ms 1.98 ms 1.1× faster
Transpose 2.85 ms 13.7 ms 4.8× faster
Vector Add 1.47 us 54.5 us 37× faster

vs System.Numerics.Tensors.TensorPrimitives (historical — REMOVED)

We previously referenced System.Numerics.Tensors and benchmarked our kernels against TensorPrimitives.* directly. As of #209 the dependency is removed entirely — every elementwise op now runs through our in-house SimdKernels, both for supply-chain hygiene and because we measured several TensorPrimitives entry points to regress 4–20× vs our in-house kernels on Ryzen 9 3950X (notably Tanh(float) ~20× slower, Sigmoid(double) ~12× slower, Log(double) ~4× slower).

Operation AiDotNet TensorPrimitives (raw) Speedup
Sigmoid (1M, float) 284 µs 7,295 µs 25× faster
TensorAdd (100K, float) 24 µs 138 µs 5.7× faster
TensorAdd (1M, float) 379 µs 614 µs 1.6× faster
TensorSum (1M, float) 196 µs 298 µs 1.5× faster
Dot Product (1K, double, in-place) 97 ns 185 ns 1.9× faster
L2 Norm (1K, double, in-place) 92 ns 187 ns 2.0× faster

Small Matrix Multiply (double)

Size AiDotNet MathNet NumSharp
4×4 172 ns 165 ns 2,198 ns
16×16 2.1 us 2.9 us 107.5 us
32×32 10.5 us 36.2 us 774.8 us

AiDotNet is 1.4× faster at 16×16 and 3.4× faster at 32×32 than MathNet.

SIMD Instruction Support

The library automatically detects and uses the best available SIMD instructions:

Instruction Set Vector Width Supported
AVX-512 512-bit (16 floats) .NET 8+
AVX2 + FMA 256-bit (8 floats) .NET 6+
AVX 256-bit (8 floats) .NET 6+
SSE4.2 128-bit (4 floats) .NET 6+
ARM NEON 128-bit (4 floats) .NET 6+

Check Available Acceleration

using AiDotNet.Tensors.Engines;

var caps = PlatformDetector.Capabilities;

// SIMD capabilities
Console.WriteLine($"AVX2: {caps.HasAVX2}");
Console.WriteLine($"AVX-512: {caps.HasAVX512F}");

// GPU support
Console.WriteLine($"CUDA: {caps.HasCudaSupport}");
Console.WriteLine($"OpenCL: {caps.HasOpenCLSupport}");

// Native library availability
Console.WriteLine($"OpenBLAS: {caps.HasOpenBlas}");
Console.WriteLine($"CLBlast: {caps.HasClBlast}");

// Or get a full status summary
Console.WriteLine(NativeLibraryDetector.GetStatusSummary());

Optional Acceleration Packages

AiDotNet.Native.OpenBLAS

Provides optimized CPU BLAS operations using OpenBLAS:

dotnet add package AiDotNet.Native.OpenBLAS

Performance: Accelerated BLAS operations for matrix multiply and decompositions.

AiDotNet.Native.CLBlast

Provides GPU acceleration via OpenCL (works on AMD, Intel, and NVIDIA GPUs):

dotnet add package AiDotNet.Native.CLBlast

Performance: 10x+ faster for large matrix operations on GPU.

AiDotNet.Native.CUDA

Provides GPU acceleration via NVIDIA CUDA (NVIDIA GPUs only):

dotnet add package AiDotNet.Native.CUDA

Performance: 30,000+ GFLOPS for matrix operations on modern NVIDIA GPUs.

Requirements:

  • NVIDIA GPU (GeForce, Quadro, or Tesla)
  • NVIDIA display driver 525.60+ (includes CUDA driver)

Usage with helpful error messages:

using AiDotNet.Tensors.Engines.DirectGpu.CUDA;

// Recommended: throws beginner-friendly exception if CUDA unavailable
using var cuda = CudaBackend.CreateOrThrow();

// Or check availability first
if (CudaBackend.IsCudaAvailable)
{
    using var backend = new CudaBackend();
    // Use CUDA acceleration
}

If CUDA is not available, you'll get detailed troubleshooting steps explaining exactly what's missing and how to fix it.

Requirements

  • .NET 10.0 or .NET Framework 4.7.1+
  • Windows x64, Linux x64, or macOS x64/arm64

License

Apache 2.0 - See LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
.NET Framework net471 is compatible.  net472 was computed.  net48 was computed.  net481 was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages (1)

Showing the top 1 NuGet packages that depend on AiDotNet.Tensors:

Package Downloads
AiDotNet

A comprehensive .NET library for machine learning, deep learning, NLP, computer vision, and AI model serving. Licensed under BSL 1.1 — free for non-commercial use, community license available at aidotnet.dev. Model save/load requires a free or paid license after a 10-operation trial. Optional anonymous telemetry (opt-in via AIDOTNET_TELEMETRY=true) collects usage metrics — no PII or model data is collected.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.72.0 0 5/5/2026
0.71.0 42 5/4/2026
0.70.2 857 5/4/2026
0.70.1 67 5/3/2026
0.70.0 934 5/3/2026
0.69.5 49 5/3/2026
0.69.4 799 5/2/2026
0.69.3 166 5/2/2026
0.69.2 47 5/2/2026
0.69.1 371 5/1/2026
0.69.0 182 5/1/2026
0.68.0 901 4/30/2026
0.67.1 75 4/30/2026
0.67.0 76 4/30/2026
0.66.0 80 4/30/2026
0.65.0 86 4/29/2026
0.64.0 87 4/29/2026
0.63.0 82 4/29/2026
0.62.0 85 4/29/2026
0.61.0 90 4/29/2026
Loading failed