DotCompute.Backends.OpenCL 0.6.2

dotnet add package DotCompute.Backends.OpenCL --version 0.6.2
                    
NuGet\Install-Package DotCompute.Backends.OpenCL -Version 0.6.2
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="DotCompute.Backends.OpenCL" Version="0.6.2" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="DotCompute.Backends.OpenCL" Version="0.6.2" />
                    
Directory.Packages.props
<PackageReference Include="DotCompute.Backends.OpenCL" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add DotCompute.Backends.OpenCL --version 0.6.2
                    
#r "nuget: DotCompute.Backends.OpenCL, 0.6.2"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package DotCompute.Backends.OpenCL@0.6.2
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=DotCompute.Backends.OpenCL&version=0.6.2
                    
Install as a Cake Addin
#tool nuget:?package=DotCompute.Backends.OpenCL&version=0.6.2
                    
Install as a Cake Tool

DotCompute.Backends.OpenCL

Cross-platform OpenCL compute backend for .NET 9+ with GPU and accelerator support.

Status: ⚠️ EXPERIMENTAL

EXPERIMENTAL: This backend is functional for cross-platform GPU acceleration but has not been extensively production-tested across all vendor implementations. It works well for development and testing across NVIDIA, AMD, and Intel GPUs. Production use recommended only after validation on your target hardware.

The OpenCL backend provides cross-platform GPU acceleration:

  • OpenCL Runtime Integration: P/Invoke bindings to OpenCL C API
  • Device Management: Platform and device enumeration
  • Context Management: OpenCL context creation and lifecycle
  • Memory Management: Device memory allocation and transfers
  • Kernel Compilation: Runtime kernel compilation from OpenCL C
  • Ring Kernel Support: Persistent kernels with message passing
  • Plugin Architecture: Integrated with DotCompute plugin system
  • Cross-Vendor Support: NVIDIA, AMD, Intel, ARM Mali, Qualcomm Adreno

Key Components

OpenCL Accelerator

OpenCLAccelerator

Main accelerator implementation providing:

  • Device initialization and management
  • Kernel compilation and execution
  • Memory allocation and synchronization
  • OpenCL context lifecycle management
  • Error handling and diagnostics

Device Management

OpenCLDeviceManager

Manages OpenCL devices:

  • Platform enumeration
  • Device discovery and selection
  • Capability detection
  • Device information queries
  • Multi-device support
OpenCLDeviceInfo

Device information structure:

  • Device name and vendor
  • OpenCL version and driver version
  • Memory sizes (global, local, constant)
  • Compute capabilities (work group size, compute units)
  • Image support and dimensions
  • Device type (GPU, CPU, Accelerator)
OpenCLPlatformInfo

Platform information:

  • Platform name and vendor
  • OpenCL version support
  • Available extensions
  • Device count

Context and Execution

OpenCLContext

OpenCL context wrapper:

  • Context creation from devices
  • Command queue management
  • Resource lifecycle
  • Error handling
  • Synchronization primitives

Memory Management

OpenCLMemoryManager

Unified memory manager for OpenCL:

  • Device memory allocation
  • Host-device memory transfers
  • Buffer management
  • Memory pooling support
  • Synchronous and asynchronous operations
OpenCLMemoryBuffer

Buffer implementation:

  • Device buffer allocation
  • Read/write operations
  • Zero-copy mapping when supported
  • Rectangular buffer support
  • Sub-buffer creation

Kernel Management

OpenCLCompiledKernel

Compiled kernel representation:

  • Kernel compilation from OpenCL C source
  • Argument binding
  • Execution with work dimensions
  • Local memory specification
  • Synchronous and asynchronous execution

Ring Kernel Runtime

OpenCLRingKernelRuntime

Persistent kernel runtime for OpenCL:

  • Launch and activation control
  • Message queue management with atomic operations
  • Status monitoring and metrics collection
  • Deactivation and termination support
  • Compatible with all OpenCL 1.2+ devices
OpenCLRingKernelCompiler

Ring kernel compilation for OpenCL:

  • Generates OpenCL C code for persistent kernels
  • Message queue implementation with atomics
  • Control block for kernel lifecycle management
  • Lock-free communication patterns

Factory

OpenCLAcceleratorFactory

Factory for creating OpenCL accelerators:

  • Automatic device selection
  • Configuration-based creation
  • Workload profile matching
  • Performance profile tuning

Native Interop

OpenCLRuntime

P/Invoke bindings to OpenCL C API:

  • Platform and device functions
  • Context and queue functions
  • Memory object functions
  • Kernel functions
  • Event and synchronization functions
OpenCLTypes

Native type definitions:

  • Platform and device IDs
  • Context and queue handles
  • Memory object handles
  • Kernel handles
  • Error codes and status types
OpenCLException

Exception type for OpenCL errors:

  • Error code mapping
  • Human-readable error messages
  • Stack trace preservation

Installation

dotnet add package DotCompute.Backends.OpenCL --version 0.5.3

Usage

Basic Setup

using DotCompute.Backends.OpenCL;
using Microsoft.Extensions.Logging;

var logger = LoggerFactory.Create(builder => builder.AddConsole())
    .CreateLogger<OpenCLAccelerator>();

// Create accelerator
var accelerator = new OpenCLAccelerator(logger, loggerFactory);

// Initialize with default device (first GPU or CPU)
await accelerator.InitializeAsync();

Console.WriteLine($"Using: {accelerator.Name}");
Console.WriteLine($"Global Memory: {accelerator.Info.TotalMemory / (1024*1024)} MB");

Service Registration

using Microsoft.Extensions.DependencyInjection;

var services = new ServiceCollection();

// Register OpenCL backend
services.AddSingleton<IAccelerator, OpenCLAccelerator>();

// OR use plugin registration
services.AddDotComputeBackend("DotCompute.Backends.OpenCL");

Device Selection

using DotCompute.Backends.OpenCL.DeviceManagement;

var deviceManager = new OpenCLDeviceManager(logger);

// Enumerate all devices
var devices = await deviceManager.EnumerateDevicesAsync();

foreach (var device in devices)
{
    Console.WriteLine($"Device: {device.Name}");
    Console.WriteLine($"  Type: {device.DeviceType}");
    Console.WriteLine($"  Compute Units: {device.MaxComputeUnits}");
    Console.WriteLine($"  Global Memory: {device.GlobalMemorySize / (1024*1024)} MB");
    Console.WriteLine($"  Local Memory: {device.LocalMemorySize / 1024} KB");
}

// Select specific device
var selectedDevice = devices.FirstOrDefault(d => d.DeviceType == DeviceType.GPU);
if (selectedDevice != null)
{
    await accelerator.InitializeAsync(selectedDevice);
}

Kernel Compilation and Execution

using DotCompute.Abstractions.Kernels;

// Define OpenCL kernel
var kernelDef = new KernelDefinition
{
    Name = "VectorAdd",
    Source = @"
        __kernel void vector_add(
            __global const float* a,
            __global const float* b,
            __global float* result,
            const int length)
        {
            int gid = get_global_id(0);
            if (gid < length) {
                result[gid] = a[gid] + b[gid];
            }
        }
    ",
    EntryPoint = "vector_add"
};

// Compile kernel
var compiledKernel = await accelerator.CompileKernelAsync(kernelDef);

// Allocate device memory
var length = 1_000_000;
var bufferA = await accelerator.Memory.AllocateAsync<float>(length);
var bufferB = await accelerator.Memory.AllocateAsync<float>(length);
var bufferResult = await accelerator.Memory.AllocateAsync<float>(length);

// Copy data to device
var dataA = Enumerable.Range(0, length).Select(i => (float)i).ToArray();
var dataB = Enumerable.Range(0, length).Select(i => (float)(i * 2)).ToArray();

await bufferA.CopyFromAsync(dataA);
await bufferB.CopyFromAsync(dataB);

// Set kernel arguments and execute
var launchParams = new KernelLaunchParameters
{
    GlobalWorkSize = new[] { (uint)length },
    LocalWorkSize = new[] { 256u }
};

await compiledKernel.ExecuteAsync(new object[]
{
    bufferA,
    bufferB,
    bufferResult,
    length
}, launchParams);

// Read results back
var results = new float[length];
await bufferResult.CopyToAsync(results);

// Cleanup
await bufferA.DisposeAsync();
await bufferB.DisposeAsync();
await bufferResult.DisposeAsync();

Memory Operations

// Allocate buffer
var buffer = await accelerator.Memory.AllocateAsync<float>(10_000);

// Write to device
var hostData = new float[10_000];
await buffer.CopyFromAsync(hostData);

// Read from device
var resultData = new float[10_000];
await buffer.CopyToAsync(resultData);

// Map memory for zero-copy access (if supported)
if (accelerator.DeviceInfo?.SupportsHostMemoryMapping == true)
{
    var mappedPtr = await buffer.MapAsync(MapMode.ReadWrite);
    // Access memory directly...
    await buffer.UnmapAsync(mappedPtr);
}

Using Factory

using DotCompute.Backends.OpenCL.Factory;

var factory = new OpenCLAcceleratorFactory(configuration, logger);

// Create accelerator with performance profile
var accelerator = await factory.CreateAsync(new WorkloadProfile
{
    WorkloadType = WorkloadType.Compute,
    DataSize = DataSize.Large,
    MemoryIntensive = true
});

Architecture

Component Hierarchy

OpenCLAccelerator (IAccelerator)
    ├── OpenCLContext (Context management)
    ├── OpenCLDeviceManager (Device discovery)
    ├── OpenCLMemoryManager (Memory operations)
    └── OpenCLCompiledKernel (Kernel execution)

Native Layer:
    ├── OpenCLRuntime (P/Invoke bindings)
    ├── OpenCLTypes (Native type definitions)
    └── OpenCLException (Error handling)

Initialization Flow

  1. Platform Enumeration: Detect all OpenCL platforms
  2. Device Discovery: Find devices on each platform
  3. Device Selection: Choose appropriate device
  4. Context Creation: Create OpenCL context for device
  5. Queue Creation: Create command queue for execution
  6. Memory Manager: Initialize memory management
  7. Ready: Accelerator ready for kernel execution

Kernel Execution Flow

  1. Kernel Compilation: Compile OpenCL C to device binary
  2. Argument Binding: Bind buffers and scalar arguments
  3. Work Sizing: Calculate global and local work sizes
  4. Enqueue: Enqueue kernel for execution
  5. Synchronize: Wait for completion (if synchronous)
  6. Result Retrieval: Copy results back to host

Supported Platforms

Operating Systems

  • Windows: 10, 11, Server 2019+
  • Linux: Most distributions with OpenCL runtime
  • macOS: 10.13+ (deprecated by Apple, prefer Metal backend)

OpenCL Versions

  • OpenCL 1.2: Minimum supported version
  • OpenCL 2.0: Full feature support
  • OpenCL 2.1/2.2/3.0: Enhanced features when available

Device Types

GPU Devices
  • NVIDIA: GeForce, Quadro, Tesla (via NVIDIA OpenCL runtime)
  • AMD: Radeon, FirePro, Instinct (via AMD OpenCL or ROCm)
  • Intel: Iris, Arc Graphics (via Intel OpenCL runtime)
  • ARM Mali: Mobile and embedded GPUs
  • Qualcomm Adreno: Mobile GPUs
CPU Devices
  • Intel: via Intel OpenCL CPU runtime
  • AMD: via AMD OpenCL CPU runtime
  • ARM: via ARM Compute Library
Accelerator Devices
  • FPGA: Intel/Xilinx FPGA with OpenCL support
  • DSP: Specialized signal processing accelerators

System Requirements

Minimum

  • .NET 9.0 or later
  • OpenCL 1.2 compatible device
  • OpenCL runtime installed
  • OpenCL 2.0+ compatible device
  • 4GB+ device memory
  • Latest vendor drivers

Installing OpenCL Runtime

Windows
  • NVIDIA: Install CUDA Toolkit or NVIDIA drivers
  • AMD: Install AMD Radeon Software
  • Intel: Install Intel Graphics drivers
Linux
  • NVIDIA: Install CUDA Toolkit or nvidia-opencl-icd
  • AMD: Install ROCm or amdgpu-pro drivers
  • Intel: Install intel-opencl-icd or beignet
# Ubuntu/Debian
sudo apt-get install ocl-icd-opencl-dev nvidia-opencl-icd

# Fedora/RHEL
sudo dnf install ocl-icd-devel pocl

# Verify installation
clinfo
macOS

OpenCL is deprecated on macOS. Use Metal backend for macOS devices.

Configuration

Environment Variables

# Enable OpenCL debugging
export DOTCOMPUTE_OPENCL_DEBUG=1

# Select specific platform
export DOTCOMPUTE_OPENCL_PLATFORM=0

# Select specific device
export DOTCOMPUTE_OPENCL_DEVICE=0

# Force CPU device (for debugging)
export DOTCOMPUTE_OPENCL_FORCE_CPU=1

Configuration Options

var options = new OpenCLOptions
{
    PreferredDeviceType = DeviceType.GPU,
    EnableProfiling = true,
    EnableOutOfOrderExecution = false,
    BuildOptions = "-cl-fast-relaxed-math -cl-mad-enable",
    CacheKernels = true
};

Current Limitations

  1. Image Processing: Limited support for image objects
  2. Shared Virtual Memory: SVM support not implemented
  3. Device Partitioning: Sub-device creation not supported
  4. Pipes: OpenCL 2.0 pipes not implemented
  5. P2P Message Passing: Ring kernel P2P strategy not available on OpenCL (use SharedMemory or AtomicQueue)

Troubleshooting

Device Not Found

  1. Verify Runtime: Run clinfo to list OpenCL platforms and devices
  2. Check Drivers: Ensure latest GPU drivers installed
  3. Permissions: On Linux, ensure user in video group
  4. Platform Selection: Try different OpenCL platforms

Compilation Failures

  1. Kernel Syntax: Validate OpenCL C syntax
  2. Build Options: Check build options compatibility
  3. Extensions: Verify required extensions available
  4. Device Capabilities: Check device limits (work group size, etc.)

Performance Issues

  1. Work Group Size: Optimize local work size for device
  2. Memory Access: Ensure coalesced memory access patterns
  3. Transfer Overhead: Minimize host-device transfers
  4. Kernel Complexity: Profile kernel execution time

Debug Tools

// Enable detailed logging
var logger = LoggerFactory.Create(builder =>
    builder.AddConsole().SetMinimumLevel(LogLevel.Trace));

// Get device capabilities
var info = accelerator.DeviceInfo;
Console.WriteLine($"Max Work Group Size: {info.MaxWorkGroupSize}");
Console.WriteLine($"Max Compute Units: {info.MaxComputeUnits}");
Console.WriteLine($"Extensions: {string.Join(", ", info.Extensions)}");

// Profile kernel execution
var sw = Stopwatch.StartNew();
await kernel.ExecuteAsync(args, launchParams);
await accelerator.SynchronizeAsync();
sw.Stop();
Console.WriteLine($"Kernel time: {sw.ElapsedMilliseconds}ms");

Advanced Features

Multi-Device Execution

// Create accelerator for each device
var accelerators = new List<OpenCLAccelerator>();
foreach (var device in devices)
{
    var acc = new OpenCLAccelerator(logger, loggerFactory);
    await acc.InitializeAsync(device);
    accelerators.Add(acc);
}

// Distribute work across devices
var tasks = accelerators.Select(acc =>
    acc.CompileKernelAsync(kernelDef)
       .ContinueWith(t => t.Result.ExecuteAsync(args))
);

await Task.WhenAll(tasks);

Custom Build Options

var options = new CompilationOptions
{
    OptimizationLevel = OptimizationLevel.O3,
    CustomOptions = new[]
    {
        "-cl-mad-enable",           // Mad operations
        "-cl-fast-relaxed-math",    // Fast math
        "-cl-finite-math-only",     // No INF/NaN
        "-cl-unsafe-math-optimizations"
    }
};

var kernel = await accelerator.CompileKernelAsync(definition, options);

Ring Kernels with OpenCL

using DotCompute.Abstractions.RingKernels;

// Define persistent ring kernel for graph processing
[RingKernel(
    KernelId = "graph-process",
    Domain = RingKernelDomain.GraphAnalytics,
    Mode = RingKernelMode.Persistent,
    Capacity = 8192,
    Backends = KernelBackends.OpenCL)]
public static void ProcessGraphVertex(
    IMessageQueue<GraphMessage> incoming,
    IMessageQueue<GraphMessage> outgoing,
    Span<float> values)
{
    int vertexId = Kernel.ThreadId.X;

    // Process messages with OpenCL atomic operations
    while (incoming.TryDequeue(out var msg))
    {
        if (msg.TargetVertex == vertexId)
            values[vertexId] += msg.Value;
    }

    // Send updates to neighbors
    outgoing.Enqueue(new GraphMessage { TargetVertex = ..., Value = ... });
}

// Launch ring kernel on OpenCL device
var runtime = orchestrator.GetRingKernelRuntime();
await runtime.LaunchAsync("graph-process", gridSize: 1024, blockSize: 256);
await runtime.ActivateAsync("graph-process");

// Send messages
await runtime.SendMessageAsync("graph-process", new GraphMessage { ... });

// Monitor performance
var metrics = await runtime.GetMetricsAsync("graph-process");
Console.WriteLine($"Throughput: {metrics.ThroughputMsgsPerSec:F2} msgs/sec");
Console.WriteLine($"GPU Utilization: {metrics.GpuUtilizationPercent:F1}%");

Dependencies

  • DotCompute.Core: Core runtime components
  • DotCompute.Abstractions: Interface definitions
  • DotCompute.Plugins: Plugin system integration
  • System.Runtime.InteropServices: P/Invoke support
  • Polly: Resilience and retry policies
  • Microsoft.Extensions.Logging: Logging infrastructure

Future Enhancements

  1. OpenCL 2.0+ Features: SVM, pipes, device-side enqueue
  2. Image Support: Comprehensive image object operations
  3. SPIR-V: Support for SPIR-V kernels
  4. Sub-Devices: Device partitioning support
  5. Interoperability: OpenGL/DirectX interop
  6. Performance: Further optimization and tuning

Documentation & Resources

Comprehensive documentation is available for DotCompute:

Architecture Documentation

Developer Guides

API Documentation

Support

Contributing

Contributions are welcome, particularly in:

  • Testing on diverse OpenCL implementations
  • Platform-specific optimizations
  • Additional OpenCL feature support
  • Performance benchmarking
  • Documentation improvements

See CONTRIBUTING.md for guidelines.

References

License

MIT License - Copyright (c) 2025 Michael Ivertowski

Product Compatible and additional computed target framework versions.
.NET net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages (2)

Showing the top 2 NuGet packages that depend on DotCompute.Backends.OpenCL:

Package Downloads
DotCompute.Runtime

Runtime services and dependency injection integration for DotCompute. Provides kernel execution orchestration, automatic kernel discovery, service registration, and DI container integration with Microsoft.Extensions.DependencyInjection. Production-ready with comprehensive service lifetime management.

DotCompute.Linq

GPU-accelerated LINQ extensions for DotCompute. Transparent GPU execution for LINQ queries with automatic kernel generation, fusion optimization, and Reactive Extensions support.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.6.2 122 2/9/2026
0.5.3 250 2/2/2026
0.5.2 610 12/8/2025
0.5.1 448 11/28/2025
0.5.0 226 11/27/2025
0.4.2-rc2 389 11/11/2025
0.4.1-rc2 333 11/6/2025