Parquet.Net 4.5.4

.NET 6.0 .NET Core 3.1 .NET Standard 2.0

There is a newer version of this package available.
See the version list below for details.

dotnet add package Parquet.Net --version 4.5.4

NuGet\Install-Package Parquet.Net -Version 4.5.4

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Parquet.Net" Version="4.5.4" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add Parquet.Net --version 4.5.4

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Parquet.Net, 4.5.4"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install Parquet.Net as a Cake Addin
#addin nuget:?package=Parquet.Net&version=4.5.4

// Install Parquet.Net as a Cake Tool
#tool nuget:?package=Parquet.Net&version=4.5.4

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Apache Parquet for .NET

Icon

Fully portable, managed .NET library to 📖read and ✍️write Apache Parquet files. Targets .NET 7, .NET 6.0, .NET Core 3.1, .NET Standard 2.1 and .NET Standard 2.0.

Runs everywhere .NET runs Linux, MacOS, Windows, iOS, Android, Tizen, Xbox, PS4, Raspberry Pi, Samsung TVs and much more.

Quick Start

Why should I use this? I think you shouldn't. Go away and look at better alternatives, like PyArrow that does it much better in Python. Also I'd rather you use Apache Spark with native support for Parquet and other commercial alternatives. Seriously. Comparing to those, this library is just pure shite, developed in spare time by one person. Despite that, it's a de facto standard for .NET when it comes to reading and writing Parquet files. Why? Because:

It has zero dependencies - pure library that just works.
It's really fast. Faster than Python and Java implementations.
It's .NET native. Designed to utilise .NET and made for .NET developers.

Parquet is designed to handle complex data in bulk. It's column-oriented meaning that data is physically stored in columns rather than rows. This is very important for big data systems if you want to process only a subset of columns - reading just the right columns is extremely efficient.

As a quick start, suppose we have the following data records we'd like to save to parquet:

Timestamp.
Event name.
Meter value.

Or, to translate it to C# terms, this can be expressed as the following class:

class Record {
    public DateTime Timestamp { get; set; }
    public string EventName { get; set; }
    public double MeterValue { get; set; }
}

✍️Writing Data

Let's say you have around a million of events like that to save to a .parquet file. There are three ways to do that with this library, starting from easiest to hardest.

🚤Class Serialisation

The first one is the easiest to work with, and the most straightforward. Let's generate those million fake records:

var data = Enumerable.Range(0, 1_000_000).Select(i => new Record {
    Timestamp = DateTime.UtcNow.AddSeconds(i),
    EventName = i % 2 == 0 ? "on" : "off",
    MeterValue = i 
}).ToList();

Now, to write these to a file in say /mnt/storage/data.parquet you can use the following line of code:

await ParquetConvert.SerializeAsync(data, "/mnt/storage/data.parquet");

That's pretty much it! You can customise many things in addition to the magical magic process, but if you are a really lazy person that will do just fine for today.

🌛Row Based API

Another way to serialise data is to use row-based API. They look at your data as a Table, which consists of a set of Rows. Basically looking at the data backwards from the point of view of how Parquet format sees it. However, that's how most people think about data. This is also useful when converting data from row-based formats to parquet and vice versa. Anyway, use it, I won't judge you (very often).

Let's generate a million of rows for our table, which is slightly more complicated. First, we need to declare table and it's schema:

var table = new Table(
    new DataField<DateTime>("Timestamp"),
    new DataField<string>("EventName"),
    new DataField<double>("MeterName"));

The code above says we are creating a new empty table with 3 fields, identical to example above with class serialisation. We are essentially declaring table's schema here. Parquet format is strongly typed and all the rows will have to have identical amount of values and their types.

Now that empty table is ready, add a million rows to it:

for(int i = 0; i < 1_000_000; i++) {
    table.Add(
        DateTime.UtcNow.AddSeconds(1),
        i % 2 == 0 ? "on" : "off",
        (double)i);
}

The data will be identical to example above. And to write the table to a file:

await table.WriteAsync("/mnt/storage/data.parquet");

Of course this is a trivial example, and you can customise it further.

⚙️Low Level API

And finally, the lowest level API is the third method. This is the most performant, most Parquet-resembling way to work with data, but least intuitive and involves some knowledge of Parquet data structures.

First of all, you need schema. Always. Just like in row-based example, schema can be declared in the following way:

var schema = new ParquetSchema(
    new DataField<DateTime>("Timestamp"),
    new DataField<string>("EventName"),
    new DataField<double>("MeterName"));

Then, data columns need to be prepared for writing. As parquet is column-based format, low level APIs expect that low level column slice of data. I'll just shut up and show you the code:

var column1 = new DataColumn(
    (DataField)schema[0],
    Enumerable.Range(0, 1_000_000).Select(i => DateTime.UtcNow.AddSeconds(i)).ToArray());

var column2 = new DataColumn(
    (DataField)schema[1],
    Enumerable.Range(0, 1_000_000).Select(i => i % 2 == 0 ? "on" : "off").ToArray());

var column3 = new DataColumn(
    (DataField)schema[2],
    Enumerable.Range(0, 1_000_000).Select(i => (double)i).ToArray());

Important thing to note here - columnX variables represent data in an entire column, all the values in that column independently from other columns. Values in other columns have the same order as well. So we have created three columns with data identical to the two examples above.

Time to write it down:

using(Stream fs = System.IO.File.OpenWrite("/mnt/storage/data.parquet")) {
    using(ParquetWriter writer = await ParquetWriter.CreateAsync(schema, fs)) {
        using(ParquetRowGroupWriter groupWriter = writer.CreateRowGroup()) {
            
            await groupWriter.WriteColumnAsync(column1);
            await groupWriter.WriteColumnAsync(column2);
            await groupWriter.WriteColumnAsync(column3);
            
        }
    }
}

What's going on?:?:

We are creating output file stream. You can probably use one of the overloads in the next line though. This will be the receiver of parquet data. The stream needs to be writeable and seekable.
ParquetWriter is low-level class and is a root object to start writing from. It mostly performs coordination, check summing and enveloping of other data.
Row group is like a data partition inside the file. In this example we have just one, but you can create more if there are too many values that are hard to fit in computer memory.
Three calls to row group writer write out the columns. Note that those are performed sequentially, and in the same order as schema defines them.

📖Reading Data

Reading data also has three different approaches, so I'm going to unwrap them here in the same order as above.

🚤Class Serialisation

Provided that you have written the data, or just have some external data with the same structure as above, you can read those by simply doing the following:

Record[] data2 = await ParquetConvert.DeserializeAsync<Record>("/mnt/storage/data.parquet");

This will give us an array with one million class instances similar to this:

alternate text is missing from this package README image

Of course class serialisation has more to it, and you can customise it further than that.

🌛Row Based API

A read counterpart to the write example above is also a simple one-liner:

Table tbl = await Table.ReadAsync("/mnt/storage/data.parquet");

This will do the magic behind the scenes, give you table schema and rows, similar to this:

alternate text is missing from this package README image

As always, there's more to it.

⚙️Low Level API

And with low level API the reading is even more flexible:

using(Stream fs = System.IO.File.OpenRead("/mnt/storage/data.parquet")) {
    using(ParquetReader reader = await ParquetReader.CreateAsync(fs)) {
        for(int i = 0; i < reader.RowGroupCount; i++) { 
            using(ParquetRowGroupReader rowGroupReader = reader.OpenRowGroupReader(i)) {

                foreach(DataField df in reader.Schema.GetDataFields()) {
                    DataColumn columnData = await rowGroupReader.ReadColumnAsync(df);

                    // do something to the column...
                }
            }
        }
    }
}

This is what's happening:

Create read stream fs.
Create ParquetReader - root class for read operations.
The reader has RowGroupCount property which indicates how many row groups (like partitions) the file contains.
Explicitly open row group for reading.
Read each DataField from the row group, in the same order as it's declared in the schema.

Hint: you can also use web based reader app to test your files, which was created using this library!

Choosing the API

If you have a choice, then the choice is easy - use Low Level API. They are the fastest and the most flexible. But what if you for some reason don't have a choice? Then think about this:

Feature	🚤Class Serialisation	🌛Table API	⚙️Low Level API
Performance	high	very low	very high
Developer Convenience	feels like C# (great!)	feels like Excel	close to Parquet
Row based access	easy	easy	hard
Column based access	hard	hard	easy

Contributing

Any contributions are welcome, in any form. Documentation, code, tests, donations or anything else. I don't like processes so anything goes. If you happen to get interested in parquet development, there are some interesting links.

Product	Compatible and additional computed target framework versions.
.NET	net5.0 was computed. net5.0-windows was computed. net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 is compatible. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed.
.NET Core	netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 is compatible.
.NET Standard	netstandard2.0 is compatible. netstandard2.1 is compatible.
.NET Framework	net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed.
MonoAndroid	monoandroid was computed.
MonoMac	monomac was computed.
MonoTouch	monotouch was computed.
Tizen	tizen40 was computed. tizen60 was computed.
Xamarin.iOS	xamarinios was computed.
Xamarin.Mac	xamarinmac was computed.
Xamarin.TVOS	xamarintvos was computed.
Xamarin.WatchOS	xamarinwatchos was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

.NETCoreApp 3.1
- IronCompress (>= 1.3.0)
- Microsoft.IO.RecyclableMemoryStream (>= 2.3.1)
.NETStandard 2.0
- IronCompress (>= 1.3.0)
- Microsoft.IO.RecyclableMemoryStream (>= 2.3.1)
- System.Reflection.Emit.Lightweight (>= 4.7.0)
- System.Text.Json (>= 7.0.2)
- System.Threading.Tasks.Extensions (>= 4.5.4)
.NETStandard 2.1
- IronCompress (>= 1.3.0)
- Microsoft.IO.RecyclableMemoryStream (>= 2.3.1)
- System.Text.Json (>= 7.0.2)
net6.0
- IronCompress (>= 1.3.0)
- Microsoft.IO.RecyclableMemoryStream (>= 2.3.1)
net7.0
- IronCompress (>= 1.3.0)
- Microsoft.IO.RecyclableMemoryStream (>= 2.3.1)

NuGet packages (26)

Showing the top 5 NuGet packages that depend on Parquet.Net:

Package	Downloads
ChoETL.Parquet Parquet extension to Cinchoo ETL framework	296.3K
Microsoft.DataPrep Microsoft Azure Machine Learning Data Preparation SDK.	109.8K
Komodo.Core Komodo core libraries for crawling (file, object, web, database), parsing (JSON, XML, SQL, Sqlite, HTML, text), postings (inverted index, token extraction), indexing (search), metadata generation, and integrating within your application. Komodo is an information search, metadata, storage, and retrieval platform.	23.4K
Microsoft.ML.Parquet ML.NET components for Apache Parquet support.	17.5K
PCAxis.Serializers Paxiom serializers for formats like Excel, jsonstat, jsonstat2, sdmx	16.4K

GitHub repositories (8)

Showing the top 5 popular GitHub repositories that depend on Parquet.Net:

Repository	Stars
dotnet/machinelearning ML.NET is an open source and cross-platform machine learning framework for .NET.	9.0K
ravendb/ravendb ACID Document Database	3.6K
Cinchoo/ChoETL ETL framework for .NET (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)	801
mukunku/ParquetViewer Simple Windows desktop application for viewing & querying Apache Parquet files	776
compomics/ThermoRawFileParser Thermo RAW file parser that runs on Linux/Mac and all other platforms that support Mono	188

Version	Downloads	Last updated
5.0.1	24,138	10/14/2024
5.0.1-pre.1	497	10/3/2024
5.0.0	12,908	10/3/2024
5.0.0-pre.4	58	10/2/2024
5.0.0-pre.3	56	10/2/2024
5.0.0-pre.2	511	9/30/2024
5.0.0-pre.1	123	9/23/2024
4.25.0	71,977	9/9/2024
4.25.0-pre.3	126	9/6/2024
4.25.0-pre.2	2,156	6/11/2024
4.25.0-pre.1	97	6/7/2024
4.24.0	283,875	6/6/2024
4.24.0-pre.8	70	6/4/2024
4.24.0-pre.7	70	6/4/2024
4.24.0-pre.6	69	6/3/2024
4.24.0-pre.5	54	6/3/2024
4.24.0-pre.4	84	5/31/2024
4.24.0-pre.3	70	5/31/2024
4.24.0-pre.2	464	5/28/2024
4.24.0-pre.1	113	5/22/2024
4.23.5	218,866	4/4/2024
4.23.4	425,132	2/2/2024
4.23.3	19,508	1/25/2024
4.23.2	8,503	1/22/2024
4.23.1	6,701	1/19/2024
4.23.0	2,558	1/18/2024
4.22.1	8,188	1/17/2024
4.22.0	55,483	1/11/2024
4.20.1	4,779	1/10/2024
4.20.0	13,779	1/8/2024
4.19.0	23,042	1/5/2024
4.18.1	10,270	12/31/2023
4.18.0	23,189	12/22/2023
4.17.0	289,602	11/14/2023
4.16.4	502,122	9/11/2023
4.16.3	27,987	9/4/2023
4.16.2	95,174	8/22/2023
4.16.1	4,208	8/21/2023
4.16.0	2,601	8/17/2023
4.15.0	145,933	6/30/2023
4.14.0	28,294	6/28/2023
4.13.0	88,462	6/20/2023
4.12.0	132,613	5/22/2023
4.11.3	10,710	5/18/2023
4.11.2	49,771	5/16/2023
4.11.1	34,462	5/10/2023
4.11.0	4,905	5/9/2023
4.10.1	36,765	5/2/2023
4.10.0	30,714	4/26/2023
4.9.2	2,803	4/25/2023
4.9.1	3,421	4/21/2023
4.9.0	2,164	4/21/2023
4.8.1	10,499	4/19/2023
4.8.0	2,480	4/18/2023
4.8.0-alpha-00	1,553	4/17/2023
4.7.1	6,225	4/14/2023
4.7.0	7,377	4/13/2023
4.6.2	66,647	3/28/2023
4.6.1	5,498	3/23/2023
4.6.0	12,105	3/21/2023
4.5.4	322,440	2/23/2023
4.5.3	16,305	2/22/2023
4.5.2	10,916	2/20/2023
4.5.1	7,943	2/14/2023
4.5.0	7,303	2/13/2023
4.4.7	12,881	2/8/2023
4.4.6	47,735	1/31/2023
4.4.5	3,220	1/30/2023
4.4.4	2,529	1/27/2023
4.4.3	3,405	1/26/2023
4.4.2	1,726	1/26/2023
4.4.1	2,083	1/25/2023
4.4.0	3,116	1/24/2023
4.3.4	1,963	1/23/2023
4.3.3	2,146	1/20/2023
4.3.2	2,043	1/19/2023
4.3.1	1,682	1/19/2023
4.3.0	2,658	1/18/2023
4.2.3	6,565	1/16/2023
4.2.2	10,023	1/11/2023
4.2.1	7,374	1/10/2023
4.2.0	1,954	1/10/2023
4.1.3	64,518	12/21/2022
4.1.2	27,853	12/1/2022
4.1.1	117,190	11/10/2022
4.1.0	223,356	10/13/2022
4.0.2	10,497	10/12/2022
4.0.1	26,496	10/11/2022
4.0.0	56,838	9/22/2022
3.10.0	301,202	9/20/2022
3.9.1	1,529,580	10/14/2021
3.9.0	370,008	6/25/2021
3.8.6	456,036	3/5/2021
3.8.5	26,349	2/23/2021
3.8.4	270,426	12/13/2020
3.8.3	2,386	12/10/2020
3.8.2	1,939	12/10/2020
3.8.1	31,791	11/6/2020
3.8.0	4,309	11/6/2020
3.7.7	327,489	6/25/2020
3.7.6	25,583	6/16/2020
3.7.5	14,381	6/8/2020
3.7.4	187,071	5/19/2020
3.7.2	3,866	5/18/2020
3.7.1	48,729	4/21/2020
3.7.0	56,787	4/19/2020
3.6.0	5,430,634	1/23/2020
3.5.3	13,963	1/8/2020
3.5.2	3,518	1/3/2020
3.5.1	2,225	12/31/2019
3.5.0	9,599	12/18/2019
3.4.3	7,361	12/16/2019
3.4.2	3,768	12/13/2019
3.4.1	2,175	12/11/2019
3.4.0	2,981	12/11/2019
3.3.11	6,881	12/1/2019
3.3.10	48,329	11/6/2019
3.3.9	187,933	8/15/2019
3.3.8	8,772	8/1/2019
3.3.7	2,157	8/1/2019
3.3.6	2,271	7/31/2019
3.3.5	32,964	7/5/2019
3.3.4	183,449	3/11/2019
3.3.3	20,377	2/1/2019
3.3.2	26,422	1/21/2019
3.3.1	4,592	1/14/2019
3.3.0	3,701	1/11/2019
3.2.6	2,876	1/11/2019
3.2.5	4,540	1/3/2019
3.2.4	10,289	11/21/2018
3.2.3	50,621	11/7/2018
3.2.2	5,367	10/30/2018
3.2.1	2,459	10/30/2018
3.2.0	3,154	10/24/2018
3.1.4	2,492	10/15/2018
3.1.3	2,370	10/15/2018
3.1.2	46,615	10/11/2018
3.1.1	2,850	10/4/2018
3.1.0	2,787	10/3/2018
3.1.0-preview-390	2,067	10/3/2018
3.1.0-preview-373	2,328	10/2/2018
3.0.5	8,998	8/13/2018
3.0.4	2,525	7/25/2018
3.0.3	2,377	7/25/2018
3.0.2	2,829	7/24/2018
3.0.1	2,373	7/24/2018
3.0.0	3,320	7/19/2018
2.1.4	94,986	6/7/2018
2.1.3	260,557	3/30/2018
2.1.2	440,998	1/10/2018
2.1.1	115,645	12/1/2017
2.1.0	2,711	11/29/2017
2.0.1	2,513	11/27/2017
2.0.0	3,447	11/27/2017
1.5.1	3,323	11/14/2017
1.4.0	6,630	10/23/2017
1.3.0	5,346	9/12/2017
1.2.139	3,442	9/6/2017
1.1.128	3,313	8/15/2017
1.0.114	2,672	7/31/2017

Parquet.Net 4.5.4

Apache Parquet for .NET

Quick Start

✍️Writing Data

🚤Class Serialisation

🌛Row Based API

⚙️Low Level API

📖Reading Data

🚤Class Serialisation

🌛Row Based API

⚙️Low Level API

Choosing the API

Contributing

.NETCoreApp 3.1

.NETStandard 2.0

.NETStandard 2.1

net6.0

net7.0

NuGet packages (26)

GitHub repositories (8)