DotCompute.Backends.CUDA 0.6.2

.NET 9.0

dotnet add package DotCompute.Backends.CUDA --version 0.6.2

NuGet\Install-Package DotCompute.Backends.CUDA -Version 0.6.2

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="DotCompute.Backends.CUDA" Version="0.6.2" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="DotCompute.Backends.CUDA" Version="0.6.2" />
                    

                            Directory.Packages.props

<PackageReference Include="DotCompute.Backends.CUDA" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add DotCompute.Backends.CUDA --version 0.6.2

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: DotCompute.Backends.CUDA, 0.6.2"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package DotCompute.Backends.CUDA@0.6.2

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=DotCompute.Backends.CUDA&version=0.6.2
                    

                            Install as a Cake Addin

#tool nuget:?package=DotCompute.Backends.CUDA&version=0.6.2
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

DotCompute.Backends.CUDA

Production-ready NVIDIA GPU compute backend for .NET 9+ with full CUDA support.

Status: ✅ Production Ready

The CUDA backend provides GPU acceleration through:

Complete CUDA Integration: Full NVRTC and CUDA Runtime API support
Multi-GPU Support: Device enumeration and P2P transfers
Ring Kernel Support: Persistent kernels with P2P, NCCL, and shared memory messaging
Compute Capability Detection: Support for CC 5.0+ GPUs
Memory Management: Device memory allocation and unified memory
Native AOT Compatible: Full compatibility with Native AOT compilation

Key Features

CUDA Runtime Integration

Device Management: Multi-GPU enumeration and selection
Kernel Compilation: OpenCL C to PTX compilation via NVRTC
Execution Engine: Stream management and asynchronous execution
Memory Operations: Device memory allocation and host-device transfers

Performance Optimizations

P2P Transfers: Direct GPU-to-GPU memory copying via NVLink
Unified Memory: Automatic data migration between CPU/GPU
Stream Processing: Asynchronous kernel execution
Memory Pooling: Device memory reuse to minimize allocation overhead

Hardware Support

Tested Hardware: RTX 2000 Ada Generation (CC 8.9)
Minimum Requirement: Compute Capability 5.0+ (Maxwell architecture)
Driver Support: CUDA Toolkit 12.0+ with compatible drivers
Multi-GPU: Full support for multi-GPU systems with NVLink

Installation

dotnet add package DotCompute.Backends.CUDA --version 0.5.3

Usage

Basic Setup

using DotCompute.Backends.CUDA;
using Microsoft.Extensions.Logging;

var logger = LoggerFactory.Create(builder => builder.AddConsole())
    .CreateLogger<CudaAccelerator>();

var accelerator = new CudaAccelerator(logger);

// Check availability before initialization
if (await accelerator.IsAvailableAsync())
{
    await accelerator.InitializeAsync();
    // GPU is ready for compute operations
}

Service Registration

services.AddSingleton<IAccelerator, CudaAccelerator>();
// OR
services.AddCudaBackend(); // Includes automatic GPU detection

Kernel Compilation and Execution

var kernelDef = new KernelDefinition
{
    Name = "VectorAdd",
    Source = @"
        extern ""C"" __global__ 
        void vector_add(float* a, float* b, float* result, int n)
        {
            int idx = blockIdx.x * blockDim.x + threadIdx.x;
            if (idx < n) {
                result[idx] = a[idx] + b[idx];
            }
        }
    ",
    EntryPoint = "vector_add"
};

var compiledKernel = await accelerator.CompileKernelAsync(kernelDef);

// Execute with launch parameters
var launchParams = new KernelLaunchParameters
{
    GridDim = new Dim3((uint)((length + 255) / 256), 1, 1),
    BlockDim = new Dim3(256, 1, 1),
    SharedMemorySize = 0
};

await compiledKernel.ExecuteAsync(parameters, launchParams);

Memory Management

// Allocate device memory
var deviceBuffer = await accelerator.AllocateAsync<float>(1_000_000);

// Copy data to GPU
await deviceBuffer.CopyFromAsync(hostData);

// Use in kernel execution
var parameters = new { 
    input = deviceBuffer,
    output = outputBuffer,
    length = 1_000_000
};

await compiledKernel.ExecuteAsync(parameters);

// Copy results back
await deviceBuffer.CopyToAsync(hostResults);

Architecture

Device Management

The CUDA backend automatically handles:

Device Detection: Enumerates all CUDA-capable GPUs
Capability Checking: Validates compute capability requirements
Context Management: Creates and manages CUDA contexts
Multi-GPU Coordination: Handles device selection and P2P setup

Compilation Pipeline

Source Validation: Validates OpenCL C/CUDA C kernel source
NVRTC Compilation: Compiles to PTX intermediate representation
Module Loading: Loads PTX into CUDA context
Function Extraction: Extracts kernel function references
Caching: Stores compiled modules for reuse

Memory System

Device Allocation: cuMemAlloc for device memory
Host-Pinned Memory: cudaMallocHost for efficient transfers
Unified Memory: cudaMallocManaged for automatic migration
P2P Transfers: Direct memory copying between GPUs

Performance Benchmarks

Tested on RTX 2000 Ada Generation (8GB VRAM, CC 8.9):

Operation	Data Size	CPU Time	GPU Time	Speedup
Vector Add	10M floats	45ms	2.1ms	21x
Matrix Mult	2048x2048	8.2s	89ms	92x
FFT	1M complex	156ms	8.4ms	18x
Reduction	10M floats	38ms	1.8ms	21x

Memory Bandwidth

Host to Device: ~12 GB/s (PCIe 4.0 x16)
Device Memory: ~450 GB/s (GDDR6)
P2P Transfer: ~25 GB/s (NVLink when available)

System Requirements

Hardware Requirements

NVIDIA GPU: Compute Capability 5.0 or higher
Memory: Minimum 2GB VRAM, 8GB+ recommended
PCIe: PCIe 3.0 x16 or better for optimal performance

Software Requirements

CUDA Toolkit: 12.0 or later
Driver Version: 535.54 or later
Operating System: Windows 10+, Linux, or WSL2

Supported GPUs

RTX Series: RTX 2000+, RTX 3000+, RTX 4000+ (CC 7.5-8.9)
GTX Series: GTX 1050+ (CC 6.1+)
Quadro/Tesla: Most modern professional GPUs
Data Center: A100, H100, V100 series

Configuration

Environment Variables

# CUDA installation path (auto-detected)
export CUDA_PATH="/usr/local/cuda"

# Enable additional logging
export DOTCOMPUTE_CUDA_VERBOSE=1

# Force specific GPU device
export CUDA_VISIBLE_DEVICES=0

Compilation Options

var options = new CompilationOptions
{
    OptimizationLevel = OptimizationLevel.O3, // Maximum optimization
    EnableDebugInfo = false, // Disable for production
    TargetArchitecture = "compute_75", // Target specific CC
    CustomOptions = new[] { "--use_fast_math" }
};

var kernel = await accelerator.CompileKernelAsync(definition, options);

Troubleshooting

Common Issues

GPU Not Detected

Check Driver: Verify NVIDIA drivers are installed and up-to-date
CUDA Toolkit: Ensure CUDA Toolkit is properly installed
System Path: Verify CUDA binaries are in system PATH
Compute Mode: Check GPU is not in prohibited compute mode

Compilation Failures

Kernel Source: Validate OpenCL C/CUDA C syntax
Include Paths: Ensure CUDA headers are accessible
Compute Capability: Match target architecture to GPU capabilities
Memory Limits: Check kernel doesn't exceed resource limits

Performance Issues

Memory Bandwidth: Optimize memory access patterns
Occupancy: Balance threads per block with resource usage
Data Transfer: Minimize host-device memory copies
Kernel Launch: Reduce kernel launch overhead with batching

Debug Tools

// Enable detailed CUDA logging
var logger = LoggerFactory.Create(builder => 
    builder.AddConsole().SetMinimumLevel(LogLevel.Trace));

// Get detailed GPU information
var info = await accelerator.GetDeviceInfoAsync();
Console.WriteLine($"GPU: {info.Name}, Memory: {info.TotalMemory / (1024*1024*1024)}GB");

// Profile kernel execution
var stopwatch = Stopwatch.StartNew();
await kernel.ExecuteAsync(parameters);
await accelerator.SynchronizeAsync();
stopwatch.Stop();
Console.WriteLine($"Kernel time: {stopwatch.ElapsedMilliseconds}ms");

Advanced Features

Multi-GPU Programming

// Enumerate all available GPUs
var devices = await CudaAccelerator.GetAvailableDevicesAsync();

// Create accelerators for each GPU
var accelerators = new List<CudaAccelerator>();
foreach (var device in devices)
{
    var acc = new CudaAccelerator(device, logger);
    await acc.InitializeAsync();
    accelerators.Add(acc);
}

// Enable P2P access between GPUs
await CudaAccelerator.EnablePeerAccessAsync(accelerators[0], accelerators[1]);

Unified Memory

// Allocate unified memory (accessible from CPU and GPU)
var unifiedBuffer = await accelerator.AllocateUnifiedAsync<float>(size);

// CPU can access directly
for (int i = 0; i < size; i++)
    unifiedBuffer[i] = i * 2.0f;

// GPU kernels can access the same memory
await kernel.ExecuteAsync(new { data = unifiedBuffer, size });

// Results automatically available on CPU
Console.WriteLine($"Result: {unifiedBuffer[0]}");

Documentation & Resources

Comprehensive documentation is available for DotCompute:

Architecture Documentation

Backend Integration - CUDA implementation and P2P architecture
Memory Management - GPU memory pooling and transfers

Developer Guides

Getting Started - Installation and CUDA setup
Backend Selection - When to use GPU acceleration
Multi-GPU Programming - P2P transfers and multi-GPU coordination
Performance Tuning - GPU optimization techniques
Troubleshooting - CUDA-specific issues

Examples

Image Processing - GPU-accelerated filters (15-80x speedup)
Matrix Operations - GPU matrix multiplication (1000x with tiling)

API Documentation

API Reference - Complete API documentation
Diagnostic Rules - DC001-DC012 analyzer reference

Support

Documentation: Comprehensive Guides
Issues: GitHub Issues
Discussions: GitHub Discussions

Contributing

The CUDA backend welcomes contributions in:

New GPU architecture support (Ada Lovelace, Hopper)
Advanced CUDA features (streams, events, graphs)
Performance optimizations and benchmarks
Multi-GPU coordination improvements

See CONTRIBUTING.md for development guidelines.

Product	Compatible and additional computed target framework versions.
.NET	net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net9.0
- DotCompute.Abstractions (>= 0.6.2)
- DotCompute.Core (>= 0.6.2)
- DotCompute.Generators (>= 0.6.2)
- DotCompute.Plugins (>= 0.6.2)
- Microsoft.CodeAnalysis.CSharp (>= 5.0.0)
- Microsoft.Extensions.Configuration.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Configuration.Binder (>= 10.0.2)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Hosting.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.2)
- Microsoft.NET.ILLink.Tasks (>= 9.0.12)
- Polly (>= 8.6.5)
- Polly.Extensions (>= 8.6.5)
- System.Runtime.InteropServices (>= 4.3.0)

NuGet packages (4)

Showing the top 4 NuGet packages that depend on DotCompute.Backends.CUDA:

Package	Downloads
DotCompute.Runtime Runtime services and dependency injection integration for DotCompute. Provides kernel execution orchestration, automatic kernel discovery, service registration, and DI container integration with Microsoft.Extensions.DependencyInjection. Production-ready with comprehensive service lifetime management.	3.1K
DotCompute.Linq GPU-accelerated LINQ extensions for DotCompute. Transparent GPU execution for LINQ queries with automatic kernel generation, fusion optimization, and Reactive Extensions support.	2.3K
Orleans.GpuBridge.Runtime Runtime implementation for Orleans GPU Bridge - provides kernel catalog, memory management, and provider selection	1.3K
Orleans.GpuBridge.Backends.DotCompute DotCompute backend provider for Orleans.GpuBridge.Core - Enables GPU acceleration via CUDA, OpenCL, Metal, and CPU with attribute-based kernel definition.	1.1K

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.6.2	128	2/9/2026
0.5.3	299	2/2/2026
0.5.2	644	12/8/2025
0.5.1	620	11/28/2025
0.5.0	224	11/27/2025
0.4.2-rc2	395	11/11/2025
0.4.1-rc2	340	11/6/2025

Total 3.8K

Current version 128

Per day average 21

dotcompute cuda gpu nvidia nvrtc kernel parallel compute ring-kernel nccl production-ready