Imagibee.Gigantor 1.0.1

There is a newer version of this package available.
See the version list below for details.
dotnet add package Imagibee.Gigantor --version 1.0.1                
NuGet\Install-Package Imagibee.Gigantor -Version 1.0.1                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Imagibee.Gigantor" Version="1.0.1" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add Imagibee.Gigantor --version 1.0.1                
#r "nuget: Imagibee.Gigantor, 1.0.1"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install Imagibee.Gigantor as a Cake Addin
#addin nuget:?package=Imagibee.Gigantor&version=1.0.1

// Install Imagibee.Gigantor as a Cake Tool
#tool nuget:?package=Imagibee.Gigantor&version=1.0.1                

Gigantor

Boosts regular expression performance, and adds support for using gigantic files and streams

It solves the following problems:

  • file exceeds the size of memory
  • CPUs are under-utilized
  • main thread is unresponsive
  • searching streams
  • searching compressed data

The approach is to partition the data into chunks which are processed in parallel using a System.Threading.ThreadPool of background threads. Since the threads are in the background they do not cause the main thread to become unresponsive. Since the chunks are reasonably sized it does not matter if the whole file can fit into memory.

RegexSearcher

RegexSearcher is the class that boosts regular expression performance for gigantic files or streams. Search was benchmarked at about 2.7 Gigabyte/s which was roughly 4x faster than the single threaded baseline. It depends on a System.Text.RegularExpressions.Regex to do the searching of the partitions. It uses an overlap to handle matches that fall on partition boundaries. De-duping of the overlap regions is performed automatically at the end of the search so that the final results are free of duplicates. Performance can be further enhanced by simultaneously searching multiple regular expressions or files for use cases that have these dimensions.

// Create a regular expression to match urls
System.Text.RegularExpressions.Regex regex = new(
    @"/https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#()?&//=]*)/",
    RegexOptions.Compiled);

// Create the searcher
Imagibee.Gigantor.RegexSearcher searcher = new("myfile", regex, progress);

// Do the search
Imagibee.Gigantor.Background.StartAndWait(
    searcher,
    progress,
    (_) => { Console.Write("."); },
    1000);

// Do something with the matches
foreach (var match in searcher.GetMatchData()) {
    ...
}

LineIndexer

LineIndexer is the class that creates a mapping between line numbers and file positions for gigantic files. Once the mapping has been created it can be used to quickly find the position at the start of a line or the line number that contains a position. Index creation was benchmarked at about 2.5 Gigabyte/s which was roughly 4x faster than the single threaded baseline.

// Create the indexer
LineIndexer indexer = new("myfile", progress);

// Do the indexing
Imagibee.Gigantor.Background.StartAndWait(
indexer,
    progress,
    (_) => { Console.Write("."); },
    1000);

// Use indexer to print the middle line
using System.IO.FileStream fs = new("myfile", FileMode.Open);
Imagibee.Gigantor.StreamReader reader = new(fs);
fs(indexer.PositionFromLine(indexer.LineCount / 2), SeekOrigin.Begin);
Console.WriteLine(reader.ReadLine());

Input Data

The input data can either be uncompressed files, or streams. Files should be used when possible because they were benchmarked to be faster than streams. However, one notable use case for streams is searching compressed data without decompressing it to disk first.

Examples

Benchmarks

Testing

Prior to running the tests run Scripts/setup to prepare the test files. This script creates some large files in the temporary folder which are deleted on reboot. Once setup has been completed run Scripts/test.

License

MIT

Versioning

This package uses semantic versioning. Tags on the main branch indicate versions. It is recomended to use a tagged version. The latest version on the main branch should be considered under development when it is not tagged.

Issues

Report and track issues here.

Contributing

Minor changes such as bug fixes are welcome. Simply make a pull request. Please discuss more significant changes prior to making the pull request by opening a new issue that describes the change.

Product Compatible and additional computed target framework versions.
.NET net5.0 is compatible.  net5.0-windows was computed.  net6.0 is compatible.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 is compatible.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
.NET Core netcoreapp3.0 was computed.  netcoreapp3.1 is compatible. 
.NET Standard netstandard2.1 is compatible. 
MonoAndroid monoandroid was computed. 
MonoMac monomac was computed. 
MonoTouch monotouch was computed. 
Tizen tizen60 was computed. 
Xamarin.iOS xamarinios was computed. 
Xamarin.Mac xamarinmac was computed. 
Xamarin.TVOS xamarintvos was computed. 
Xamarin.WatchOS xamarinwatchos was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
3.0.0 477 4/5/2023
2.0.1 197 4/3/2023
2.0.0 191 4/3/2023
1.0.2 184 3/30/2023
1.0.1 202 3/25/2023
1.0.0 216 3/24/2023
0.8.2 246 3/8/2023
0.8.1 217 3/8/2023
0.8.0 239 3/6/2023
0.7.1 234 3/6/2023
0.7.0 232 3/5/2023
0.6.3 233 3/1/2023
0.6.2 232 2/21/2023
0.6.1 235 2/18/2023
0.6.0 260 2/18/2023
0.5.0 249 2/13/2023
0.4.1 256 2/10/2023
0.4.0 271 2/8/2023
0.3.5 374 2/7/2023
0.3.4 243 2/7/2023
0.3.3 251 2/6/2023
0.3.2 271 2/6/2023
0.3.1 261 2/6/2023