Sep 0.2.0 - Even Faster Parsing (~10 GB/s on Zen 3) and More
Shortly after Introducing Sep - Possibly the World’s Fastest .NET CSV
Parser, Mark
Pflug author of
Sylvan.Data.Csv
adopted the PackUnsignedSaturate
trick from Sep and added a Vector256
/Avx2
path in version 1.3.2 for
impressive
improvements. This
meant Sylvan was suddenly slightly faster than Sep at the very low level parsing
of rows only. This I, of course, had to address and Sep
0.2.0 is the result. 😅
As before the Sep README.md contains the latest description and benchmarks for the library. In this blog post I will quickly point out the changes made and go into the performance improvements.
Release Notes
- Replace the “char + position” indexing approach with a “straightforward” parse one row at a time approach for 1.4x faster low level parsing.
- Add line number range (from - to excl) to
SepReader.Row
incl. tracking line endings inside quotes so easy to find a row’s lines in notepad or similar. - Add
SepReaderOptions.DisableColCountCheck
option that allows disabling exception thrown if a row does not have expected col count (fixes issue #10). - BREAKING CHANGE: Rename
SepReaderOptions.UseFastFloat
toDisableFastFloat
to follow aDisable
convention for features enabled by default. - Minor breaking change to
SepWriter
, which now always writes line ending for last row to ensure row count consistent for empty columns. - Expand README.md with options and debuggability.
Even Faster Parsing
For detailed benchmarks incl. new ARM64 benchmarks on Neoverse N1 see GitHub. Below I instead try to highlight the improvements from Sep 0.1.0 to 0.2.0 and Sylvan 1.3.1 to 1.3.2. Sylvan sees an impressive 1.66x speedup and Sep 1.39x at the Row level. A win-win scenario!
Sep now hits ~10 GB/s on an AMD 5950X Zen 3 CPU when counting 2 bytes per char. At 57.3 ns/row for a row with 25 columns for package assets benchmark this means Sep spends just 2.3 ns or ~11 cycles (~5 GHz) per column for parsing only.
Method | Scope | Version | MB/s | ns/row | Version | MB/s | ns/row | Speedup |
---|---|---|---|---|---|---|---|---|
Sep | Row | 0.1.0 | 7335.3 | 79.6 | 0.2.0 | 10191.6 | 57.3 | 1.39 |
Sylvan | Row | 1.3.1 | 4952.1 | 117.9 | 1.3.2 | 8223.5 | 71.0 | 1.66 |
Sep | Cols | 0.1.0 | 6119.8 | 95.4 | 0.2.0 | 7815.7 | 74.7 | 1.28 |
Sylvan | Cols | 1.3.1 | 3546.0 | 164.6 | 1.3.2 | 4573.7 | 127.6 | 1.29 |
Sep | Asset | 0.1.0 | 775.0 | 753.3 | 0.2.0 | 810.6 | 720.2 | 1.05 |
Sylvan | Asset | 1.3.1 | 609.7 | 957.5 | 1.3.2 | 648.3 | 900.5 | 1.06 |
SepReader Debuggability (Row Line Number Range)
Debuggability is an important part of any library and while this is still a work
in progress for Sep, SepReader
does have a unique feature when looking at it’s
row in a debug context. Given the below example code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
var text = """
Key;Value
A;"1
2
3"
B;"Apple
Banana
Orange
Pear"
""";
using var reader = Sep.Reader().FromText(text);
foreach (var row in reader)
{
// Hover over row when breaking here
if (Debugger.IsAttached && row.RowIndex == 2) { Debugger.Break(); }
}
and you are hovering over row
when the break is triggered then this will show
something like:
1
2:[5..9] = 'B;"Apple\r\nBanana\r\nOrange\r\nPear"
This has the format shown below.
1
<ROWINDEX>:[<LINENUMBERRANGE>] = '<ROW>'
Note how this shows line number range [FromIncl..ToExcl]
, as in C# range
expression,
so that one can easily find the row in question in notepad
or similar. This
means Sep has to track line endings inside quotes and is an example of a feature
that makes Sep a bit slower but which is a price considered worth paying.