vforteli.DataLakeClientExtensions
0.3.0
dotnet add package vforteli.DataLakeClientExtensions --version 0.3.0
NuGet\Install-Package vforteli.DataLakeClientExtensions -Version 0.3.0
<PackageReference Include="vforteli.DataLakeClientExtensions" Version="0.3.0" />
paket add vforteli.DataLakeClientExtensions --version 0.3.0
#r "nuget: vforteli.DataLakeClientExtensions, 0.3.0"
// Install vforteli.DataLakeClientExtensions as a Cake Addin #addin nuget:?package=vforteli.DataLakeClientExtensions&version=0.3.0 // Install vforteli.DataLakeClientExtensions as a Cake Tool #tool nuget:?package=vforteli.DataLakeClientExtensions&version=0.3.0
DataLakeFileSystemClientExtension ListPathsParallelAsync
Extension method for listing paths in parallel with Azure DataLakeFileSystemClient. In Azure DataLakeGen2, Using the ListPathsAsync method on the DataLakeServiceClient can take tens of minutes or even hours with as little as hundreds of thousands of files across directories.
This extension method uses multiple threads to avoid calling the expensive recursive version of ListPathsAsync. This improves performance significantly, however the actual numbers varies depending on the directory structure.
Benchmarks
The not so scientific benchmarks have been run on a storage account containing one filesystem containing 32 folders, each folder contains 1600 subfolders and one file and each subfolder contains 10 files.
Total files and folders: 563234.
Tests run on an MacBook Pro M2 with 100/10 Mbit connection against an Azure Storage Account with Standard SKU and hierarchical namespace enabled (Datalakegen2).
Test | Duration |
---|---|
SDK GetPathsAsync | 474 sec |
ListPathsParallelAsync 16 threads | 157 sec |
ListPathsParallelAsync 128 threads | 25 sec |
ListPathsParallelAsync 256 threads | 17 sec |
Installation
Build from source or download NuGet package: https://www.nuget.org/packages/vforteli.DataLakeClientExtensions
Target frameworks .Net 6 and .Net Standard 2.1
Usage
List files in directory
// List paths with IAsyncEnumerable
var sourceFileSystemClient = new DataLakeServiceClient(new Uri(sourceConnection)).GetFileSystemClient("somefilesystem");
await foreach (var path in sourceFileSystemClient.ListPathsParallelAsync("/"))
{
// do something with PathItem
}
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
-
net6.0
- Azure.Storage.Files.DataLake (>= 12.14.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Switch from BlockingCollection to Channel