DotCompute.Memory
0.6.2
dotnet add package DotCompute.Memory --version 0.6.2
NuGet\Install-Package DotCompute.Memory -Version 0.6.2
<PackageReference Include="DotCompute.Memory" Version="0.6.2" />
<PackageVersion Include="DotCompute.Memory" Version="0.6.2" />
<PackageReference Include="DotCompute.Memory" />
paket add DotCompute.Memory --version 0.6.2
#r "nuget: DotCompute.Memory, 0.6.2"
#:package DotCompute.Memory@0.6.2
#addin nuget:?package=DotCompute.Memory&version=0.6.2
#tool nuget:?package=DotCompute.Memory&version=0.6.2
DotCompute.Memory
High-performance unified memory management system for DotCompute with zero-copy operations and cross-device memory transfers.
Status: ✅ Production Ready
The Memory module provides comprehensive memory management capabilities:
- Memory Pooling: 90% allocation reduction through object pooling
- Zero-Copy Operations: Span<T> and Memory<T> throughout
- Cross-Device Transfers: Unified abstraction for CPU/GPU memory
- Thread Safety: Lock-free operations and concurrent access
- Native AOT: Full compatibility with Native AOT compilation
Key Features
Unified Memory Management
- UnifiedMemoryManager: Central memory management authority for DotCompute
- Cross-Backend Support: Works with CPU, CUDA, Metal, OpenCL backends
- Automatic Cleanup: Background defragmentation and resource reclamation
- Statistics and Monitoring: Comprehensive metrics for memory usage
Buffer Abstractions
OptimizedUnifiedBuffer<T>
Performance-optimized unified buffer with:
- Object pooling for frequent allocations (90% reduction target)
- Lazy initialization for expensive operations
- Zero-copy operations using Span<T> and Memory<T>
- Async-first design with optimized synchronization
- Memory prefetching for improved cache performance
- NUMA-aware memory allocation
Buffer States
Buffers track their state across devices:
- HostOnly: Data exists only in system memory
- DeviceOnly: Data exists only in device memory
- Synchronized: Data is consistent across host and device
- HostDirty: Host copy modified, needs sync to device
- DeviceDirty: Device copy modified, needs sync to host
Memory Pooling
HighPerformanceObjectPool<T>
High-performance object pool optimized for compute workloads:
- Lock-free operations using ConcurrentStack
- Automatic pool size management
- Thread-local storage for hot paths
- Performance metrics and monitoring
- Configurable eviction policies
- NUMA-aware allocation when available
MemoryPool
Specialized memory pool for buffer management:
- Buffer size categorization for optimal reuse
- Automatic cleanup of unused buffers
- Configurable retention policies
- Cross-backend buffer pooling
Advanced Transfer Engine
AdvancedMemoryTransferEngine
Sophisticated memory transfer system:
- Concurrent Transfers: Parallel host-device data movement
- Transfer Pipelining: Overlap compute and transfer operations
- Adaptive Batching: Automatic batch size optimization
- Transfer Statistics: Detailed performance metrics
- Error Recovery: Automatic retry with exponential backoff
Transfer Options
- Asynchronous Transfers: Non-blocking memory operations
- Pinned Memory: Use page-locked memory for faster transfers
- Streaming Transfers: Support for chunked data movement
- Priority Scheduling: Prioritize critical transfers
Zero-Copy Operations
ZeroCopyOperations
Optimized operations that avoid unnecessary data copies:
- Direct Span<T> manipulation
- Memory-mapped operations
- Pointer-based transformations
- SIMD-accelerated operations when available
UnsafeMemoryOperations
Low-level unsafe memory operations:
- Pointer arithmetic utilities
- Unmanaged memory marshalling
- Platform-specific optimizations
- Bounds-checked unsafe operations
Memory Utilities
MemoryAllocator
Centralized memory allocation with:
- Aligned memory allocation
- Platform-specific allocators
- Huge page support (when available)
- Memory tracking and diagnostics
ArrayPoolWrapper<T>
Wrapper around ArrayPool<T> with:
- Automatic size normalization
- Return tracking to prevent double-free
- Usage statistics
- Integration with DotCompute pooling
Performance Metrics
MemoryStatistics
Comprehensive memory usage statistics:
- Total bytes allocated/freed
- Active allocation count
- Peak memory usage
- Allocation/deallocation rates
- Pool efficiency metrics
BufferPerformanceMetrics
Per-buffer performance tracking:
- Transfer count and total time
- Average transfer bandwidth
- Last access timestamp
- Cache hit/miss ratio
TransferStatistics
Memory transfer performance data:
- Bytes transferred (host-to-device, device-to-host)
- Transfer count and average time
- Bandwidth utilization
- Transfer efficiency metrics
Installation
dotnet add package DotCompute.Memory --version 0.5.3
Usage
Basic Memory Management
using DotCompute.Memory;
using DotCompute.Abstractions;
// Create memory manager (typically done by accelerator)
var memoryManager = new UnifiedMemoryManager(accelerator, logger);
// Allocate a buffer
var buffer = await memoryManager.AllocateAsync<float>(1_000_000);
// Write data to buffer
var data = new float[1_000_000];
await buffer.CopyFromAsync(data);
// Use buffer in kernel execution
await kernel.ExecuteAsync(buffer);
// Read results back
var results = new float[1_000_000];
await buffer.CopyToAsync(results);
// Dispose when done (returns to pool if eligible)
await buffer.DisposeAsync();
CPU-Only Memory Manager
using DotCompute.Memory;
// Create CPU-only memory manager
var memoryManager = new UnifiedMemoryManager(logger);
// Allocate CPU buffers
var buffer = await memoryManager.AllocateAsync<double>(10_000);
// All operations work on CPU memory
await buffer.CopyFromAsync(cpuData);
Zero-Copy Operations with Span<T>
using DotCompute.Memory;
// Allocate buffer
var buffer = await memoryManager.AllocateAsync<float>(1000);
// Get host span for zero-copy access
var span = buffer.AsSpan();
// Direct manipulation without copying
for (int i = 0; i < span.Length; i++)
{
span[i] = i * 2.0f;
}
// Mark buffer as dirty to sync to device
await buffer.SynchronizeAsync();
Memory Pooling
using DotCompute.Memory;
// High-performance object pool
var pool = new HighPerformanceObjectPool<MyObject>(
createFunc: () => new MyObject(),
resetAction: obj => obj.Reset(),
validateFunc: obj => obj.IsValid(),
config: new PoolConfiguration
{
MaxPoolSize = 1000,
MinPoolSize = 10,
EvictionPolicy = EvictionPolicy.LRU
},
logger: logger
);
// Get object from pool
var obj = pool.Get();
// Use object
obj.DoWork();
// Return to pool for reuse
pool.Return(obj);
// Get pool statistics
var stats = pool.GetStatistics();
Console.WriteLine($"Pool hits: {stats.PoolHitRate:P2}");
Console.WriteLine($"Allocation reduction: {stats.AllocationReduction:P2}");
Advanced Transfer Engine
using DotCompute.Memory;
using DotCompute.Memory.Types;
var transferEngine = new AdvancedMemoryTransferEngine(accelerator, logger);
// Concurrent transfers with options
var options = new ConcurrentTransferOptions
{
MaxConcurrency = 4,
UsePinnedMemory = true,
EnablePipelining = true,
ChunkSize = 1024 * 1024 // 1MB chunks
};
var transfers = new[]
{
(source: data1, destination: deviceBuffer1),
(source: data2, destination: deviceBuffer2),
(source: data3, destination: deviceBuffer3)
};
var result = await transferEngine.ExecuteConcurrentTransfersAsync(transfers, options);
Console.WriteLine($"Total transferred: {result.TotalBytesTransferred:N0} bytes");
Console.WriteLine($"Transfer rate: {result.AverageBandwidth:F2} MB/s");
Console.WriteLine($"Failed transfers: {result.FailedTransfers}");
Memory Statistics and Monitoring
using DotCompute.Memory;
// Get memory statistics
var stats = memoryManager.Statistics;
Console.WriteLine($"Total allocated: {stats.TotalBytesAllocated:N0} bytes");
Console.WriteLine($"Active allocations: {stats.ActiveAllocationCount}");
Console.WriteLine($"Peak usage: {stats.PeakMemoryUsage:N0} bytes");
Console.WriteLine($"Pool efficiency: {stats.PoolEfficiency:P2}");
// Get detailed buffer metrics
var bufferMetrics = buffer.GetPerformanceMetrics();
Console.WriteLine($"Transfer count: {bufferMetrics.TransferCount}");
Console.WriteLine($"Avg bandwidth: {bufferMetrics.AverageBandwidth:F2} MB/s");
Console.WriteLine($"Last accessed: {bufferMetrics.LastAccessTime}");
Buffer Slicing
using DotCompute.Memory;
// Create buffer
var buffer = await memoryManager.AllocateAsync<int>(1000);
// Create slice (view) without copying data
var slice = buffer.Slice(100, 200); // Elements 100-299
// Operate on slice
await kernel.ExecuteAsync(slice);
// Slices share underlying memory with parent buffer
NUMA-Aware Allocation
using DotCompute.Memory;
// Allocator with NUMA awareness
var allocator = new MemoryAllocator(logger)
{
UseNumaAllocation = true,
PreferredNumaNode = 0
};
// Allocate on specific NUMA node for optimal performance
var buffer = allocator.AllocateAligned<float>(1_000_000, alignment: 64);
Architecture
Memory Hierarchy
UnifiedMemoryManager (Top-level coordinator)
├── MemoryPool (Buffer pooling and reuse)
├── AdvancedMemoryTransferEngine (Transfer orchestration)
├── MemoryAllocator (Low-level allocation)
└── Buffer Registry (Active buffer tracking)
Buffers:
├── OptimizedUnifiedBuffer<T> (General-purpose unified buffer)
├── UnifiedBufferView (Non-owning view of buffer data)
└── UnifiedBufferSlice (Slice/window into buffer)
Buffer Lifecycle
- Allocation: Request buffer from memory manager
- Initialization: Initialize host or device memory
- Data Transfer: Move data between host and device
- Execution: Use in kernel operations
- Synchronization: Ensure consistency across devices
- Disposal: Return to pool or free memory
Pooling Strategy
The memory system uses tiered pooling:
- Thread-Local Pools: Fast path for frequent allocations
- Global Pool: Shared across threads with lock-free access
- Size-Based Buckets: Categorize buffers by size for optimal reuse
- Automatic Eviction: Remove unused buffers based on LRU policy
Target metrics:
- 90%+ reduction in allocation calls
- Sub-microsecond pool access latency
- 95%+ pool hit rate for common sizes
Transfer Optimization
Memory transfers are optimized through:
- Pinned Memory: Page-locked memory for DMA transfers
- Asynchronous Transfers: Non-blocking operations
- Transfer Pipelining: Overlap compute and data movement
- Batching: Combine small transfers into larger operations
- Concurrent Streams: Parallel transfers when possible
Performance Benchmarks
Tested on RTX 2000 Ada with 16GB RAM:
| Operation | Standard | Optimized | Improvement |
|---|---|---|---|
| Buffer Allocation | 45μs | 4μs | 11.2x |
| Host-Device Copy (10MB) | 2.8ms | 2.4ms | 1.2x |
| Device-Host Copy (10MB) | 2.9ms | 2.5ms | 1.2x |
| Buffer Pool Hit | N/A | 120ns | N/A |
| Zero-Copy Access | 450ns | 45ns | 10x |
Memory usage:
- Standard allocation: 100% allocation overhead
- Pooled allocation: 10% allocation overhead (90% reduction)
System Requirements
- .NET 9.0 or later
- 4GB+ RAM recommended (8GB+ for large datasets)
- Native AOT compatible
Optional Features
- NUMA Support: Requires NUMA-aware OS (Windows Server, Linux with NUMA)
- Huge Pages: Requires OS configuration (Linux:
hugetlbfs, Windows: Large Pages privilege) - GPU Acceleration: Requires compatible accelerator (CUDA, Metal, OpenCL)
Configuration
Memory Manager Options
Configure memory manager behavior:
var options = new MemoryManagerOptions
{
EnablePooling = true,
MaxPoolSize = 1024 * 1024 * 1024, // 1GB
MinPoolSize = 64 * 1024 * 1024, // 64MB
EnableAutoDefragmentation = true,
DefragmentationInterval = TimeSpan.FromMinutes(5),
EnableStatistics = true
};
var memoryManager = new UnifiedMemoryManager(accelerator, options, logger);
Pool Configuration
var poolConfig = new PoolConfiguration
{
MaxPoolSize = 1000,
MinPoolSize = 10,
MaxObjectSize = 100 * 1024 * 1024, // 100MB
EvictionPolicy = EvictionPolicy.LRU,
MaintenanceInterval = TimeSpan.FromMinutes(1),
EnableMetrics = true
};
Troubleshooting
Out of Memory Errors
- Check Statistics: Review
memoryManager.Statisticsfor usage patterns - Enable Cleanup: Ensure automatic cleanup is running
- Manual Cleanup: Call
memoryManager.TrimExcessAsync() - Adjust Pool Size: Reduce
MaxPoolSizeif memory constrained
Performance Issues
- Pool Efficiency: Check
PoolEfficiencymetric (target: 90%+) - Transfer Bandwidth: Monitor transfer statistics for bottlenecks
- Alignment: Ensure buffers are properly aligned (64-byte for SIMD)
- Pinned Memory: Enable pinned memory for large transfers
Memory Leaks
- Dispose Buffers: Always dispose buffers explicitly or use
using - Registry Check: Review
_activeBufferscount in memory manager - Weak References: Check
_bufferRegistryfor leaked references - Enable Diagnostics: Use
UnifiedBufferDiagnosticsfor leak detection
Advanced Topics
Custom Buffer Types
Implement IUnifiedMemoryBuffer<T> for custom buffer behavior:
public class CustomBuffer<T> : IUnifiedMemoryBuffer<T> where T : unmanaged
{
public int Length { get; }
public long SizeInBytes { get; }
public async ValueTask CopyFromAsync(ReadOnlyMemory<T> source)
{
// Custom implementation
}
// Implement other interface members...
}
Integration with External Libraries
// Wrap external library buffers
var externalPointer = ExternalLibrary.AllocateBuffer(size);
var wrappedBuffer = MemoryAllocator.WrapUnmanagedPointer<float>(
externalPointer,
size,
ownsMemory: false
);
Cross-Process Memory Sharing
// Create shared memory buffer
var sharedBuffer = await memoryManager.AllocateSharedAsync<float>(
size,
name: "SharedComputeBuffer"
);
// Other process can open the same buffer
var remoteBuffer = await memoryManager.OpenSharedAsync<float>(
"SharedComputeBuffer"
);
Dependencies
- DotCompute.Abstractions: Core abstractions
- DotCompute.Core: Core runtime components
- System.IO.Pipelines: High-performance I/O
- Microsoft.Toolkit.HighPerformance: Performance utilities
- System.Runtime.InteropServices: Platform invoke support
Design Principles
- Zero-Copy First: Minimize data copying through Span<T> and Memory<T>
- Pool Everything: Reduce allocations through aggressive pooling
- Async by Default: Non-blocking operations for scalability
- Monitor Everything: Comprehensive statistics for diagnostics
- Fail Fast: Immediate validation and error reporting
- Thread-Safe: All operations safe for concurrent access
Documentation & Resources
Comprehensive documentation is available for DotCompute:
Architecture Documentation
- Memory Management Architecture - Unified buffer abstraction and pooling (90% allocation reduction)
- Backend Integration - P2P memory transfer architecture
Developer Guides
- Memory Management Guide - Memory pooling best practices and zero-copy techniques
- Performance Tuning - Memory optimization strategies (11.2x speedup)
- Multi-GPU Programming - P2P transfers (12 GB/s measured)
Examples
- Basic Vector Operations - Buffer management examples
- Image Processing - Memory-efficient image operations
API Documentation
- API Reference - Complete API documentation
- UnifiedBuffer Documentation - Buffer API reference
Support
- Documentation: Comprehensive Guides
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Contributing
Contributions are welcome, particularly in:
- Platform-specific memory optimizations
- Additional pooling strategies
- Transfer optimization techniques
- Memory usage profiling tools
See CONTRIBUTING.md for guidelines.
License
MIT License - Copyright (c) 2025 Michael Ivertowski
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net9.0
- DotCompute.Abstractions (>= 0.6.2)
- DotCompute.Core (>= 0.6.2)
- Microsoft.Extensions.Configuration.Binder (>= 10.0.2)
- Microsoft.Extensions.Hosting.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Logging (>= 10.0.2)
- Microsoft.NET.ILLink.Tasks (>= 9.0.12)
- Microsoft.Toolkit.HighPerformance (>= 7.1.2)
- System.IO.Pipelines (>= 10.0.2)
- System.Runtime.InteropServices (>= 4.3.0)
NuGet packages (4)
Showing the top 4 NuGet packages that depend on DotCompute.Memory:
| Package | Downloads |
|---|---|
|
DotCompute.Backends.CPU
Production-ready CPU compute backend for DotCompute. Provides SIMD vectorization (3.7x faster) using AVX2/AVX512/NEON instructions, multi-threaded kernel execution, and Ring Kernel simulation. Benchmarked: Vector Add (100K elements) 2.14ms → 0.58ms. Native AOT compatible with sub-10ms startup. |
|
|
DotCompute.Backends.OpenCL
Production-ready OpenCL backend for DotCompute. Cross-platform GPU acceleration for NVIDIA, AMD, Intel, ARM Mali, and Qualcomm Adreno GPUs. Supports OpenCL 1.2+, Ring Kernels with atomic message queues, runtime kernel compilation, and multi-device workload distribution. Works with nvidia-opencl-icd, ROCm, intel-opencl-icd, and vendor drivers. |
|
|
DotCompute.Linq
GPU-accelerated LINQ extensions for DotCompute. Transparent GPU execution for LINQ queries with automatic kernel generation, fusion optimization, and Reactive Extensions support. |
|
|
DotCompute.Algorithms
GPU-accelerated algorithms for DotCompute. Includes FFT, AutoDiff, sparse matrix operations, signal processing, and cryptographic primitives. |
GitHub repositories
This package is not used by any popular GitHub repositories.