RapidCsv 0.0.1

dotnet add package RapidCsv --version 0.0.1                
NuGet\Install-Package RapidCsv -Version 0.0.1                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="RapidCsv" Version="0.0.1" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add RapidCsv --version 0.0.1                
#r "nuget: RapidCsv, 0.0.1"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install RapidCsv as a Cake Addin
#addin nuget:?package=RapidCsv&version=0.0.1

// Install RapidCsv as a Cake Tool
#tool nuget:?package=RapidCsv&version=0.0.1                

Fast CSV Validator and Transformer

A .NET library for fast and efficient validation and transformation of CSV files.

Structural CSV validation rules adhere to RFC 4180.

Additional content validation rules can be configured by supplying an optional JSON validation profile. A validation profile allows specifying column names, data types, column rules (e.g. if data for that column are required, what the min/max length should be, and so on).

Performance

RFC 4180 validation on a 40 column, 100,000 row CSV file takes 235 ms and allocates a total of 100 MB of memory on an old Intel laptop CPU from the 2010s. See benchmark results for more.

You can run benchmarks using a special benchmarking project by navigating to tests/RapidCsv.Benchmarks and running dotnet run -c Release.

Basic Usage - Validate a CSV file against RFC 4180

  1. Add a reference to RapidCsv in your .csproj file:
<Project Sdk="Microsoft.NET.Sdk">

  <ItemGroup>
    <ProjectReference Include="..\..\src\RapidCsv\RapidCsv.csproj" />
  </ItemGroup>

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net8.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
  </PropertyGroup>

</Project>
  1. Add a using RapidCsv; directive at the top of your class file.

  2. Create a CsvValidator object and call its Validate method, passing both a stream and a ValidationOptions object into that method.

using RapidCsv;

string csvContent = @"NAME,AGE,DOB
John,23,1/1/2012
Mary,34,1/1/1990
Jane,25,1/1/2010
Hana,55,1/1/1970";

CsvValidator validator = new CsvValidator();
var options = new ValidationOptions()
{
    Separator = ',',
    HasHeaderRow = true
};

Stream content = GenerateStreamFromString(csvContent);
ValidationResult result = validator.Validate(content: content, options: options);

Console.WriteLine($"Valid File = {result.IsValid}");

static Stream GenerateStreamFromString(string s)
{
    var stream = new MemoryStream();
    var writer = new StreamWriter(stream);
    writer.Write(s);
    writer.Flush();
    stream.Position = 0;
    return stream;
}

Examples

The examples folder contains example code that demonstrates how to use RapidCsv.

Simplest Example: .NET Console App

Let's look at the RapidCsv.ConsoleDemo project.

  1. Navigate to examples/demo-console/ in a terminal of your choice.
  2. Enter the following into the terminal:
dotnet run
  1. Observe for the following output:
Valid File = True
 Data Rows         = 4
 Elapsed time (ms) = 3ms
 Columns           = 3
 Error count       = 0
 Warning count     = 0
 Headers = 
  Column 1 = NAME
  Column 2 = AGE
  Column 3 = DOB

That's all there is to it.

This console app includes a hard-coded CSV file in program.cs to make it as simple as possible to run the example. A CSV input file is therefore not required.

Architecture and Design Decisions

RapidCsv is meant to be used in situations where one needs speed and memory efficiency at scale. For instance, if you're required to process CSV files in near real-time at high volume, where validation results are viewable by clients almost instantly after file submission, then this is a library worth considering.

This is also why the library was built and shapes the design decisions around why the code is written the way it is.

High performance and memory efficiency

The use of ReadOnlySpan<T> in the library is intentional. A simpler way of dealing with CSV files might be to use string.Split(',') but this presents issues, namely that splitting strings copies the string's contents into new memory (the array of string fragments that the Split() method generates). This increases memory use, the extra allocations result in slightly slower code, and it increases the amount of garbage collection that must occur to clean up all that duplicated memory.

By using ReadOnlySpan<T>, a lower-level API in .NET, we can get a view into a subset of the string instead of creating copies. Spans are harder to work with from a practical standpoint and make the code harder to read and maintain.

A state machine-like algorithm is needed to parse each line in a CSV file. The algorithm goes character-by-character over the ReadOnlySpan<char> and must keep track of things like whether it's in a quoted field or not in order to know how to interpret the current character. Meanwhile, it must validate what it finds.

No limits on file size

RapidCsv operates on streams. The whole CSV file does not need to be read at once, unlike some other competing libraries, and the fast performance means even larger files (e.g. 100k rows) can be validated in under 1 second.

Human-readable error messages

Readable and understandable error messages are critical. Detected errors will give human-understandable outputs that even users with low technical skills should be able to understand, within reason.

Ease of use by developers

The library is meant to be super easy to use by developers. It's one function call in one class:

CsvValidator validator = new CsvValidator();

var options = new ValidationOptions()
{
    Separator = ',',
    HasHeaderRow = true
};

ValidationResult result = validator.Validate(content: content, options: options);

In the code snippet above, we create a validator class, pass it some very basic options, and then call the validator's validate method. Without more advanced options this will validate the file against RFC 4180 specifications.

The content in this case is of type Stream. You can then do useful things with the result type you get back, such as iterate over all the errors/warnings or read a boolean flag to see if the file is valid or invalid.

There are more advanced things you can do with the Validate method such as specify a JSON content validation configuration, which will go beyond RFC 4180 and do things like check field content against your supplied regular expressions, data type specifications, min/max values, and other rules, but it is not required to supply such a configuration.

Few to no dependencies

The software supply chain is hard to secure today. RapidCsv currently uses no dependencies.

Configurable content validation rules

Do you need to go beyond RFC 4180 rules for your real-time CSV validation needs? The validation rules allow you to specify some basic content validation checks, such as min/max length, regular expression checks, formatting checks, and data types. These show up as error type Content to distinguish them from RFC 4180 errors, which show up as error type Structural.

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net8.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
0.0.1 96 8/27/2024