DotCompute.Backends.CUDA
0.6.2
dotnet add package DotCompute.Backends.CUDA --version 0.6.2
NuGet\Install-Package DotCompute.Backends.CUDA -Version 0.6.2
<PackageReference Include="DotCompute.Backends.CUDA" Version="0.6.2" />
<PackageVersion Include="DotCompute.Backends.CUDA" Version="0.6.2" />
<PackageReference Include="DotCompute.Backends.CUDA" />
paket add DotCompute.Backends.CUDA --version 0.6.2
#r "nuget: DotCompute.Backends.CUDA, 0.6.2"
#:package DotCompute.Backends.CUDA@0.6.2
#addin nuget:?package=DotCompute.Backends.CUDA&version=0.6.2
#tool nuget:?package=DotCompute.Backends.CUDA&version=0.6.2
DotCompute.Backends.CUDA
Production-ready NVIDIA GPU compute backend for .NET 9+ with full CUDA support.
Status: ✅ Production Ready
The CUDA backend provides GPU acceleration through:
- Complete CUDA Integration: Full NVRTC and CUDA Runtime API support
- Multi-GPU Support: Device enumeration and P2P transfers
- Ring Kernel Support: Persistent kernels with P2P, NCCL, and shared memory messaging
- Compute Capability Detection: Support for CC 5.0+ GPUs
- Memory Management: Device memory allocation and unified memory
- Native AOT Compatible: Full compatibility with Native AOT compilation
Key Features
CUDA Runtime Integration
- Device Management: Multi-GPU enumeration and selection
- Kernel Compilation: OpenCL C to PTX compilation via NVRTC
- Execution Engine: Stream management and asynchronous execution
- Memory Operations: Device memory allocation and host-device transfers
Performance Optimizations
- P2P Transfers: Direct GPU-to-GPU memory copying via NVLink
- Unified Memory: Automatic data migration between CPU/GPU
- Stream Processing: Asynchronous kernel execution
- Memory Pooling: Device memory reuse to minimize allocation overhead
Hardware Support
- Tested Hardware: RTX 2000 Ada Generation (CC 8.9)
- Minimum Requirement: Compute Capability 5.0+ (Maxwell architecture)
- Driver Support: CUDA Toolkit 12.0+ with compatible drivers
- Multi-GPU: Full support for multi-GPU systems with NVLink
Installation
dotnet add package DotCompute.Backends.CUDA --version 0.5.3
Usage
Basic Setup
using DotCompute.Backends.CUDA;
using Microsoft.Extensions.Logging;
var logger = LoggerFactory.Create(builder => builder.AddConsole())
.CreateLogger<CudaAccelerator>();
var accelerator = new CudaAccelerator(logger);
// Check availability before initialization
if (await accelerator.IsAvailableAsync())
{
await accelerator.InitializeAsync();
// GPU is ready for compute operations
}
Service Registration
services.AddSingleton<IAccelerator, CudaAccelerator>();
// OR
services.AddCudaBackend(); // Includes automatic GPU detection
Kernel Compilation and Execution
var kernelDef = new KernelDefinition
{
Name = "VectorAdd",
Source = @"
extern ""C"" __global__
void vector_add(float* a, float* b, float* result, int n)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
result[idx] = a[idx] + b[idx];
}
}
",
EntryPoint = "vector_add"
};
var compiledKernel = await accelerator.CompileKernelAsync(kernelDef);
// Execute with launch parameters
var launchParams = new KernelLaunchParameters
{
GridDim = new Dim3((uint)((length + 255) / 256), 1, 1),
BlockDim = new Dim3(256, 1, 1),
SharedMemorySize = 0
};
await compiledKernel.ExecuteAsync(parameters, launchParams);
Memory Management
// Allocate device memory
var deviceBuffer = await accelerator.AllocateAsync<float>(1_000_000);
// Copy data to GPU
await deviceBuffer.CopyFromAsync(hostData);
// Use in kernel execution
var parameters = new {
input = deviceBuffer,
output = outputBuffer,
length = 1_000_000
};
await compiledKernel.ExecuteAsync(parameters);
// Copy results back
await deviceBuffer.CopyToAsync(hostResults);
Architecture
Device Management
The CUDA backend automatically handles:
- Device Detection: Enumerates all CUDA-capable GPUs
- Capability Checking: Validates compute capability requirements
- Context Management: Creates and manages CUDA contexts
- Multi-GPU Coordination: Handles device selection and P2P setup
Compilation Pipeline
- Source Validation: Validates OpenCL C/CUDA C kernel source
- NVRTC Compilation: Compiles to PTX intermediate representation
- Module Loading: Loads PTX into CUDA context
- Function Extraction: Extracts kernel function references
- Caching: Stores compiled modules for reuse
Memory System
- Device Allocation: cuMemAlloc for device memory
- Host-Pinned Memory: cudaMallocHost for efficient transfers
- Unified Memory: cudaMallocManaged for automatic migration
- P2P Transfers: Direct memory copying between GPUs
Performance Benchmarks
Tested on RTX 2000 Ada Generation (8GB VRAM, CC 8.9):
| Operation | Data Size | CPU Time | GPU Time | Speedup |
|---|---|---|---|---|
| Vector Add | 10M floats | 45ms | 2.1ms | 21x |
| Matrix Mult | 2048x2048 | 8.2s | 89ms | 92x |
| FFT | 1M complex | 156ms | 8.4ms | 18x |
| Reduction | 10M floats | 38ms | 1.8ms | 21x |
Memory Bandwidth
- Host to Device: ~12 GB/s (PCIe 4.0 x16)
- Device Memory: ~450 GB/s (GDDR6)
- P2P Transfer: ~25 GB/s (NVLink when available)
System Requirements
Hardware Requirements
- NVIDIA GPU: Compute Capability 5.0 or higher
- Memory: Minimum 2GB VRAM, 8GB+ recommended
- PCIe: PCIe 3.0 x16 or better for optimal performance
Software Requirements
- CUDA Toolkit: 12.0 or later
- Driver Version: 535.54 or later
- Operating System: Windows 10+, Linux, or WSL2
Supported GPUs
- RTX Series: RTX 2000+, RTX 3000+, RTX 4000+ (CC 7.5-8.9)
- GTX Series: GTX 1050+ (CC 6.1+)
- Quadro/Tesla: Most modern professional GPUs
- Data Center: A100, H100, V100 series
Configuration
Environment Variables
# CUDA installation path (auto-detected)
export CUDA_PATH="/usr/local/cuda"
# Enable additional logging
export DOTCOMPUTE_CUDA_VERBOSE=1
# Force specific GPU device
export CUDA_VISIBLE_DEVICES=0
Compilation Options
var options = new CompilationOptions
{
OptimizationLevel = OptimizationLevel.O3, // Maximum optimization
EnableDebugInfo = false, // Disable for production
TargetArchitecture = "compute_75", // Target specific CC
CustomOptions = new[] { "--use_fast_math" }
};
var kernel = await accelerator.CompileKernelAsync(definition, options);
Troubleshooting
Common Issues
GPU Not Detected
- Check Driver: Verify NVIDIA drivers are installed and up-to-date
- CUDA Toolkit: Ensure CUDA Toolkit is properly installed
- System Path: Verify CUDA binaries are in system PATH
- Compute Mode: Check GPU is not in prohibited compute mode
Compilation Failures
- Kernel Source: Validate OpenCL C/CUDA C syntax
- Include Paths: Ensure CUDA headers are accessible
- Compute Capability: Match target architecture to GPU capabilities
- Memory Limits: Check kernel doesn't exceed resource limits
Performance Issues
- Memory Bandwidth: Optimize memory access patterns
- Occupancy: Balance threads per block with resource usage
- Data Transfer: Minimize host-device memory copies
- Kernel Launch: Reduce kernel launch overhead with batching
Debug Tools
// Enable detailed CUDA logging
var logger = LoggerFactory.Create(builder =>
builder.AddConsole().SetMinimumLevel(LogLevel.Trace));
// Get detailed GPU information
var info = await accelerator.GetDeviceInfoAsync();
Console.WriteLine($"GPU: {info.Name}, Memory: {info.TotalMemory / (1024*1024*1024)}GB");
// Profile kernel execution
var stopwatch = Stopwatch.StartNew();
await kernel.ExecuteAsync(parameters);
await accelerator.SynchronizeAsync();
stopwatch.Stop();
Console.WriteLine($"Kernel time: {stopwatch.ElapsedMilliseconds}ms");
Advanced Features
Multi-GPU Programming
// Enumerate all available GPUs
var devices = await CudaAccelerator.GetAvailableDevicesAsync();
// Create accelerators for each GPU
var accelerators = new List<CudaAccelerator>();
foreach (var device in devices)
{
var acc = new CudaAccelerator(device, logger);
await acc.InitializeAsync();
accelerators.Add(acc);
}
// Enable P2P access between GPUs
await CudaAccelerator.EnablePeerAccessAsync(accelerators[0], accelerators[1]);
Unified Memory
// Allocate unified memory (accessible from CPU and GPU)
var unifiedBuffer = await accelerator.AllocateUnifiedAsync<float>(size);
// CPU can access directly
for (int i = 0; i < size; i++)
unifiedBuffer[i] = i * 2.0f;
// GPU kernels can access the same memory
await kernel.ExecuteAsync(new { data = unifiedBuffer, size });
// Results automatically available on CPU
Console.WriteLine($"Result: {unifiedBuffer[0]}");
Documentation & Resources
Comprehensive documentation is available for DotCompute:
Architecture Documentation
- Backend Integration - CUDA implementation and P2P architecture
- Memory Management - GPU memory pooling and transfers
Developer Guides
- Getting Started - Installation and CUDA setup
- Backend Selection - When to use GPU acceleration
- Multi-GPU Programming - P2P transfers and multi-GPU coordination
- Performance Tuning - GPU optimization techniques
- Troubleshooting - CUDA-specific issues
Examples
- Image Processing - GPU-accelerated filters (15-80x speedup)
- Matrix Operations - GPU matrix multiplication (1000x with tiling)
API Documentation
- API Reference - Complete API documentation
- Diagnostic Rules - DC001-DC012 analyzer reference
Support
- Documentation: Comprehensive Guides
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Contributing
The CUDA backend welcomes contributions in:
- New GPU architecture support (Ada Lovelace, Hopper)
- Advanced CUDA features (streams, events, graphs)
- Performance optimizations and benchmarks
- Multi-GPU coordination improvements
See CONTRIBUTING.md for development guidelines.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net9.0
- DotCompute.Abstractions (>= 0.6.2)
- DotCompute.Core (>= 0.6.2)
- DotCompute.Generators (>= 0.6.2)
- DotCompute.Plugins (>= 0.6.2)
- Microsoft.CodeAnalysis.CSharp (>= 5.0.0)
- Microsoft.Extensions.Configuration.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Configuration.Binder (>= 10.0.2)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Hosting.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.2)
- Microsoft.NET.ILLink.Tasks (>= 9.0.12)
- Polly (>= 8.6.5)
- Polly.Extensions (>= 8.6.5)
- System.Runtime.InteropServices (>= 4.3.0)
NuGet packages (4)
Showing the top 4 NuGet packages that depend on DotCompute.Backends.CUDA:
| Package | Downloads |
|---|---|
|
DotCompute.Runtime
Runtime services and dependency injection integration for DotCompute. Provides kernel execution orchestration, automatic kernel discovery, service registration, and DI container integration with Microsoft.Extensions.DependencyInjection. Production-ready with comprehensive service lifetime management. |
|
|
DotCompute.Linq
GPU-accelerated LINQ extensions for DotCompute. Transparent GPU execution for LINQ queries with automatic kernel generation, fusion optimization, and Reactive Extensions support. |
|
|
Orleans.GpuBridge.Runtime
Runtime implementation for Orleans GPU Bridge - provides kernel catalog, memory management, and provider selection |
|
|
Orleans.GpuBridge.Backends.DotCompute
DotCompute backend provider for Orleans.GpuBridge.Core - Enables GPU acceleration via CUDA, OpenCL, Metal, and CPU with attribute-based kernel definition. |
GitHub repositories
This package is not used by any popular GitHub repositories.