AiDotNet.Tensors 0.115.0

.NET 8.0 .NET Framework 4.7.1

dotnet add package AiDotNet.Tensors --version 0.115.0

NuGet\Install-Package AiDotNet.Tensors -Version 0.115.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="AiDotNet.Tensors" Version="0.115.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="AiDotNet.Tensors" Version="0.115.0" />
                    

                            Directory.Packages.props

<PackageReference Include="AiDotNet.Tensors" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add AiDotNet.Tensors --version 0.115.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: AiDotNet.Tensors, 0.115.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package AiDotNet.Tensors@0.115.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=AiDotNet.Tensors&version=0.115.0
                    

                            Install as a Cake Addin

#tool nuget:?package=AiDotNet.Tensors&version=0.115.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

AiDotNet.Tensors

A high-performance .NET tensor library with hand-written AVX2/AVX-512 SIMD kernels in SimdKernels.cs / SimdGemm.cs / SimdConvHelper.cs. Every hot path runs through our own managed-C# kernels — we do NOT call into System.Numerics.Tensors, MKL.NET, or oneDNN through the standard wrappers. Beats ML.NET, TensorFlow.NET, MathNet, and NumSharp outright on every measured op. Against libtorch (TorchSharp's hand-tuned C++ kernels), wins on Mish 2.3×, Mish (double) 2.2×, GELU (double) 1.6× ahead, Tanh (double) within noise, Tanh (float) 1.4×, TensorMean/Min/Max, MaxPool2D, TensorAdd 100K, and TensorAdd 1M (vs single-thread torch) — all using pure managed C# with hand-tuned AVX2/FMA SIMD kernels and JIT-compiled machine code.

Note on dependencies. The .nupkg ships with the following PackageReferences: Microsoft.Extensions.Logging.Abstractions, System.Text.Json, System.Threading.Channels, K4os.Compression.LZ4 (LZ4 compression for serialized tensor blobs), AiDotNet.Native.OpenBLAS (transitive native OpenBLAS for fallback paths only — our SimdGemm beats it for d=128 transformer hot paths), and MKL via Microsoft.ML.Mkl.Redist (~66 MB on win-x64) + intelmkl.redist.win-x64 (~500 MB on win-x64) for the FP64 kernels that haven't yet been ported to pure-managed AVX2 (Phase 0 remediation work tracks the port). For air-gapped / federal deployments we ship a custom build with MKL/OpenBLAS removed and the entire telemetry namespace compiled out — see aidotnet.dev/enterprise for the Enterprise tier including air-gapped builds.

Performance numbers above assume net8.0+. On net471 the SIMD/intrinsics helpers are excluded (System.Runtime.Intrinsics is unavailable pre-net6); a custom net471 SIMD path that beats System.Numerics.Vector<T> is on the roadmap as Phase 5.

Features

Zero Allocations: In-place operations with ArrayPool<T> and Span<T> for hot paths
Hand-Tuned SIMD: Custom AVX2/FMA kernels with 4x loop unrolling, not just Vector<T> wrappers
JIT-Compiled Kernels: Runtime x86-64 machine code generation for size-specialized operations
BLIS-Style GEMM: Tiled matrix multiply with FMA micro-kernel, cache-aware panel packing
GPU Acceleration: Optional CUDA, HIP/ROCm, OpenCL, Metal, Vulkan, and WebGPU support via separate packages, with CPU-vs-GPU op-parity validated across every backend (#775)
Multi-Target: Supports .NET 10.0 and .NET Framework 4.7.1
Generic Math: Works with any numeric type via INumericOperations<T> interface

Installation

# Core package (CPU SIMD acceleration)
dotnet add package AiDotNet.Tensors

# Optional: OpenBLAS for optimized CPU BLAS operations
dotnet add package AiDotNet.Native.OpenBLAS

# Optional: CLBlast for OpenCL GPU acceleration (AMD/Intel/NVIDIA)
dotnet add package AiDotNet.Native.CLBlast

# Optional: CUDA for NVIDIA GPU acceleration (requires NVIDIA GPU)
dotnet add package AiDotNet.Native.CUDA

Quick Start

using AiDotNet.Tensors.LinearAlgebra;

// Create vectors
var v1 = new Vector<double>(new[] { 1.0, 2.0, 3.0, 4.0 });
var v2 = new Vector<double>(new[] { 5.0, 6.0, 7.0, 8.0 });

// SIMD-accelerated operations
var sum = v1 + v2;
var dot = v1.Dot(v2);

// Create matrices
var m1 = new Matrix<double>(3, 3);
var m2 = Matrix<double>.Identity(3);

// Matrix operations
var product = m1 * m2;
var transpose = m1.Transpose();

CPU Benchmarks

All numbers from the latest BenchmarkDotNet run on AMD Ryzen 9 3950X (16 cores, AVX2/FMA, no AVX-512), .NET 10.0. Reproduce with:

dotnet run -c Release --project tests/AiDotNet.Tensors.Benchmarks --framework net10.0 -- --vs-all

The full per-op result set with error bars lives in tests/AiDotNet.Tensors.Benchmarks/BENCHMARK_RESULTS.md. The summary below is a hand-curated subset.

vs TorchSharp CPU (libtorch C++ backend)

Latest BDN run, post-#209 perf fixes — captured after removing System.Numerics.Tensors entirely and routing every hot path through our in-house SimdKernels. All comparisons are eager-vs-eager — neither side uses torch.compile or AiDotNet compiled plans, so this is libtorch's hand-rolled C++ kernels against AiDotNet's pure managed C# + AVX2 SIMD. See tests/AiDotNet.Tensors.Benchmarks/BENCHMARK_RESULTS.md for the full per-op table with error bars.

Big wins — AiDotNet beats TorchSharp by 2× or more:

Operation	Size	AiDotNet	TorchSharp	Speedup
Mish	1M	377 µs	884 µs	2.3× faster
Mish (double)	1M	1,038 µs	2,313 µs	2.2× faster

Wins — AiDotNet beats TorchSharp:

Operation	Size	AiDotNet	TorchSharp	Speedup
GELU (double)	1M	481 µs	753 µs	1.6× faster (was 3.6× behind!)
Tanh (double)	1M	586 µs	627 µs	1.07× faster (was 3.3× behind!)
Tanh (float)	1M	282 µs	406 µs	1.4× faster
TensorAdd	100K	33 µs	42 µs	1.3× faster
TensorMean	1M	189 µs	243 µs	1.3× faster
TensorAdd	1M (vs 1-thread torch)	350 µs	468 µs	1.3× vs 1-thread torch
MaxPool2D	—	250 µs	285 µs	1.1× faster
TensorMin	1M	205 µs	215 µs	within noise (slight win)
TensorMultiply	100K	37 µs	39 µs	within noise (slight win)

Closer-to-parity — AiDotNet within ~1.5× of libtorch:

Operation	Size	AiDotNet	TorchSharp	Ratio
ReLU	1M	261 µs	191 µs	1.4×
Sigmoid	1M	326 µs	223 µs	1.5×
TensorMaxValue	1M	195 µs	189 µs	1.03×
TensorExp	1M	296 µs	306 µs	within noise
GELU (float)	1M	354 µs	332 µs	1.07×
TensorSum	1M	229 µs	212 µs	1.08×
TensorAbs	1M	362 µs	221 µs	1.6×
LeakyReLU	1M	409 µs	273 µs	1.5×
Exp (double)	1M	753 µs	284 µs	2.6× (was 4.3×)
Log (double)	1M	612 µs	355 µs	1.7× (was 16×!)

This PR's #209 close-parity wins — validated against the pre-fix baseline by fresh BDN re-runs and same-process micro-benchmarks:

Operation	Pre-fix	Post-fix	Improvement
Softmax_Double 512×1024	3,766 µs	185 µs (slightly AHEAD of torch's 206!)	20× faster
GELU_Double 1M	2,782 µs	481 µs (now 1.6× ahead of torch!)	5.8× faster
Tanh_Double 1M	2,067 µs	586 µs (within noise of torch)	3.5× faster
Log_Double 1M	5,785 µs	612 µs	9.4× faster
Exp_Double 1M	1,634 µs	753 µs	2.2× faster
LayerNorm 32k×64	1,347 µs	890 µs	1.5× faster
TensorAdd 1M	480 µs	350 µs	1.4× faster
AttentionQKT 512×64	599 µs	419 µs (parallel-M pre-transpose)	1.4× faster
AttentionQKT 512×128	(not measured)	451 µs	(149 GFLOPS, parallel-M kernel)
MatMul 256³	510 µs	196 µs (parallel-M SgemmDirect)	2.6× faster
MatMul 512³	1,074 µs	930 µs	1.15× faster
Conv2D 1×16×64×64→32	458 µs (regressed to 764 with naive 4-oc)	397 µs (Auto policy picks PerChannel)	back to baseline + 13%

Residual tracked gaps — areas where libtorch's Intel MKL-DNN (with AVX-512 inner kernels on Intel hardware) still wins. These need multi-day kernel rewrites (single-pass register-resident LayerNorm, fused QKᵀ attention kernel, BLIS-style 6×16 micro-kernel prefetch tuning) and are left as follow-up work:

Operation	Size	AiDotNet	TorchSharp	Ratio
TensorMatMul (float)	256	196 µs (parallel-M SgemmDirect)	109 µs	1.8× — was 4.7×
TensorMatMul (float)	512	930 µs	534 µs	1.7× — was 2.0×
LayerNorm	32k×64	890 µs	303 µs	2.9×
BatchNorm	32×64×32×32	2,201 µs	745 µs	3.0×
Conv2D (float)	1×16×64×64→32	~397 µs (Auto picks PerChannel)	310 µs	1.3× — was 2.3× before A/B fix
Conv2D (double)	4×3×32×32	438 µs	115 µs	3.8× — unchanged this PR
AttentionQKT	512×64	419 µs (parallel-M pre-transpose)	135 µs	3.1× — was 4.3×
AttentionQKT	512×128	451 µs (parallel-M)	—	149 GFLOPS, was 1,102 µs
Softmax_Double 512×1024	—	185 µs	206 µs	slight win ✓ closed

Zero-external-dependency policy. Every hot path runs through our hand-tuned SimdKernels AVX2/AVX-512 implementations. We deliberately do NOT reference System.Numerics.Tensors, MKL, MKL.NET, or oneDNN — both for supply-chain hygiene and because we measured several TensorPrimitives entry points to regress 4–20× vs our in-house kernels on Ryzen 9 3950X (notably Tanh(float) 20× slower, Sigmoid(double) 12× slower, Log(double) 4× slower). All double-precision and single-precision paths now go through the same hand-tuned SIMD kernels — no fallback to any external library.

vs ML.NET (Microsoft.ML, eager-vs-eager)

Latest BDN run, validated post-#209-perf. Microsoft's general-purpose ML framework — same Ryzen 9 3950X, same .NET 10.0.7.

Operation	Size	AiDotNet	ML.NET	Speedup
TensorMean	1M	80 µs	180 µs	2.2× faster
TensorSum	1M	92 µs	104 µs	1.1× faster
TensorAdd	100K	106 µs	55 µs	0.5× (memory-bound — ML.NET stayed allocator-warm)
TensorMultiply	100K	106 µs	60 µs	0.6× (memory-bound)
TensorAdd	1M	800 µs	601 µs	0.75× (memory-bound)
TensorMultiply	1M	782 µs	595 µs	0.76× (memory-bound)

The 1M-element bulk ops are memory-bandwidth-bound: at ~50 GB/s sustained DRAM bandwidth on Zen 2, a 4 MB read + 4 MB read + 4 MB write = 12 MB of traffic per call → 240 µs theoretical floor before any allocator overhead. Both libraries are within 2× of that floor.

vs TensorFlow.NET CPU (eager-vs-eager)

Latest BDN run, validated post-#209-perf. SciSharp's TensorFlow .NET binding (eager mode, no graph compile). Same hardware. AiDotNet wins outright on every measured op except small-Conv2D and 256×256 MatMul.

Operation	Size	AiDotNet	TensorFlow.NET	Speedup
TensorSum	1M	77 µs	259 µs	3.4× faster
TensorMean	1M	76 µs	189 µs	2.5× faster
TensorMultiply	100K	119 µs	202 µs	1.7× faster
Sigmoid	1M	1,264 µs	1,941 µs	1.5× faster
TensorAdd	100K	141 µs	211 µs	1.5× faster
TensorMatMul	512	1,286 µs	1,554 µs	1.2× faster
TensorAdd	1M	1,340 µs	1,478 µs	1.1× faster
ReLU	1M	1,680 µs	1,606 µs	within noise (high stddev 713 µs)
TensorMultiply	1M	1,655 µs	1,347 µs	0.81× (memory-bound)
TensorMatMul	256	432 µs	398 µs	0.92×
Conv2D	4×3×32×32	719 µs	428 µs	0.6×

The fresh validation run captured full data on bulk Add/Multiply + 256/512 MatMul (the original fcb7fea baseline showed NA because SciSharp's TensorFlow.NET was crashing at those shapes; later runtime versions stabilized).

vs MathNet.Numerics (Linear Algebra, double, N=1000)

Operation	AiDotNet	MathNet	Speedup
Matrix Multiply 1000×1000	8.3 ms	49.2 ms	6× faster
Matrix Add	1.87 ms	2.50 ms	1.3× faster
Matrix Subtract	2.08 ms	2.47 ms	1.2× faster
Matrix Scalar Multiply	1.66 ms	2.14 ms	1.3× faster
Transpose	2.85 ms	3.68 ms	1.3× faster
Dot Product	97 ns	817 ns	8.4× faster
L2 Norm	92 ns	11,552 ns	125× faster

vs NumSharp (N=1000)

Operation	AiDotNet	NumSharp	Speedup
Matrix Multiply 1000×1000	8.3 ms	26.5 s	3,200× faster
Matrix Add	1.87 ms	1.98 ms	1.1× faster
Transpose	2.85 ms	13.7 ms	4.8× faster
Vector Add	1.47 us	54.5 us	37× faster

vs System.Numerics.Tensors.TensorPrimitives (historical — REMOVED)

We previously referenced System.Numerics.Tensors and benchmarked our kernels against TensorPrimitives.* directly. As of #209 the dependency is removed entirely — every elementwise op now runs through our in-house SimdKernels, both for supply-chain hygiene and because we measured several TensorPrimitives entry points to regress 4–20× vs our in-house kernels on Ryzen 9 3950X (notably Tanh(float) ~20× slower, Sigmoid(double) ~12× slower, Log(double) ~4× slower).

Operation	AiDotNet	TensorPrimitives (raw)	Speedup
Sigmoid (1M, float)	284 µs	7,295 µs	25× faster
TensorAdd (100K, float)	24 µs	138 µs	5.7× faster
TensorAdd (1M, float)	379 µs	614 µs	1.6× faster
TensorSum (1M, float)	196 µs	298 µs	1.5× faster
Dot Product (1K, double, in-place)	97 ns	185 ns	1.9× faster
L2 Norm (1K, double, in-place)	92 ns	187 ns	2.0× faster

Small Matrix Multiply (double)

Size	AiDotNet	MathNet	NumSharp
4×4	172 ns	165 ns	2,198 ns
16×16	2.1 us	2.9 us	107.5 us
32×32	10.5 us	36.2 us	774.8 us

AiDotNet is 1.4× faster at 16×16 and 3.4× faster at 32×32 than MathNet.

SIMD Instruction Support

The library automatically detects and uses the best available SIMD instructions:

Instruction Set	Vector Width	Supported
AVX-512	512-bit (16 floats)	.NET 8+
AVX2 + FMA	256-bit (8 floats)	.NET 6+
AVX	256-bit (8 floats)	.NET 6+
SSE4.2	128-bit (4 floats)	.NET 6+
ARM NEON	128-bit (4 floats)	.NET 6+

Check Available Acceleration

using AiDotNet.Tensors.Engines;

var caps = PlatformDetector.Capabilities;

// SIMD capabilities
Console.WriteLine($"AVX2: {caps.HasAVX2}");
Console.WriteLine($"AVX-512: {caps.HasAVX512F}");

// GPU support
Console.WriteLine($"CUDA: {caps.HasCudaSupport}");
Console.WriteLine($"OpenCL: {caps.HasOpenCLSupport}");

// Native library availability
Console.WriteLine($"OpenBLAS: {caps.HasOpenBlas}");
Console.WriteLine($"CLBlast: {caps.HasClBlast}");

// Or get a full status summary
Console.WriteLine(NativeLibraryDetector.GetStatusSummary());

Optional Acceleration Packages

AiDotNet.Native.OpenBLAS

Provides optimized CPU BLAS operations using OpenBLAS:

dotnet add package AiDotNet.Native.OpenBLAS

Performance: Accelerated BLAS operations for matrix multiply and decompositions.

AiDotNet.Native.CLBlast

Provides GPU acceleration via OpenCL (works on AMD, Intel, and NVIDIA GPUs):

dotnet add package AiDotNet.Native.CLBlast

Performance: 10x+ faster for large matrix operations on GPU.

AiDotNet.Native.CUDA

Provides GPU acceleration via NVIDIA CUDA (NVIDIA GPUs only):

dotnet add package AiDotNet.Native.CUDA

Performance: 30,000+ GFLOPS for matrix operations on modern NVIDIA GPUs.

Requirements:

NVIDIA GPU (GeForce, Quadro, or Tesla)
NVIDIA display driver 525.60+ (includes CUDA driver)

Usage with helpful error messages:

using AiDotNet.Tensors.Engines.DirectGpu.CUDA;

// Recommended: throws beginner-friendly exception if CUDA unavailable
using var cuda = CudaBackend.CreateOrThrow();

// Or check availability first
if (CudaBackend.IsCudaAvailable)
{
    using var backend = new CudaBackend();
    // Use CUDA acceleration
}

If CUDA is not available, you'll get detailed troubleshooting steps explaining exactly what's missing and how to fix it.

Requirements

.NET 10.0 or .NET Framework 4.7.1+
Windows x64, Linux x64, or macOS x64/arm64

License

Apache 2.0 - See LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.
.NET Framework	net471 is compatible. net472 was computed. net48 was computed. net481 was computed.

Product

.NET

.NET Framework

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

.NETFramework 4.7.1
- AiDotNet.Native.OpenBLAS (>= 0.75.6)
- K4os.Compression.LZ4 (>= 1.3.8)
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.0)
- Microsoft.ML.Mkl.Redist (>= 5.0.0)
- System.Text.Json (>= 8.0.5)
- System.Threading.Channels (>= 8.0.0)
net10.0
- AiDotNet.Native.OpenBLAS (>= 0.75.6)
- K4os.Compression.LZ4 (>= 1.3.8)
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.0)
- Microsoft.ML.Mkl.Redist (>= 5.0.0)
net8.0
- AiDotNet.Native.OpenBLAS (>= 0.75.6)
- K4os.Compression.LZ4 (>= 1.3.8)
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.0)
- Microsoft.ML.Mkl.Redist (>= 5.0.0)

NuGet packages (1)

Showing the top 1 NuGet packages that depend on AiDotNet.Tensors:

Package	Downloads
AiDotNet A comprehensive .NET library for machine learning, deep learning, NLP, computer vision, and AI model serving. Licensed under BSL 1.1 — free for non-commercial use, community license available at aidotnet.dev. Model save/load requires a free or paid license after a 10-operation trial. Optional anonymous telemetry (opt-in via AIDOTNET_TELEMETRY=true) collects usage metrics — no PII or model data is collected.	42.4K

Package

Downloads

AiDotNet

A comprehensive .NET library for machine learning, deep learning, NLP, computer vision, and AI model serving. Licensed under BSL 1.1 — free for non-commercial use, community license available at aidotnet.dev. Model save/load requires a free or paid license after a 10-operation trial. Optional anonymous telemetry (opt-in via AIDOTNET_TELEMETRY=true) collects usage metrics — no PII or model data is collected.

42.4K

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.115.0	0	7/16/2026
0.114.5	644	7/15/2026
0.114.4	48	7/15/2026
0.114.3	1,138	7/13/2026
0.114.2	577	7/13/2026
0.114.1	242	7/13/2026
0.114.0	1,758	7/11/2026
0.113.0	1,059	7/10/2026
0.112.0	637	7/10/2026
0.111.2	1,016	7/9/2026
0.111.1	1,733	7/9/2026
0.111.0	413	7/7/2026
0.110.1	130	7/6/2026
0.110.0	1,437	7/5/2026
0.109.0	106	7/5/2026
0.108.1	112	7/4/2026
0.108.0	98	7/2/2026
0.107.0	102	7/2/2026
0.106.1	6,243	7/1/2026
0.106.0	216	6/30/2026