Sep 0.2.0 - Even Faster Parsing (~10 GB/s on Zen 3) and More

Shortly after Introducing Sep - Possibly the World’s Fastest .NET CSV Parser, Mark Pflug author of Sylvan.Data.Csv adopted the PackUnsignedSaturate trick from Sep and added a Vector256/Avx2 path in version 1.3.2 for impressive improvements. This meant Sylvan was suddenly slightly faster than Sep at the very low level parsing of rows only. This I, of course, had to address and Sep 0.2.0 is the result. 😅

As before the Sep README.md contains the latest description and benchmarks for the library. In this blog post I will quickly point out the changes made and go into the performance improvements.

Release Notes

  • Replace the “char + position” indexing approach with a “straightforward” parse one row at a time approach for 1.4x faster low level parsing.
  • Add line number range (from - to excl) to SepReader.Row incl. tracking line endings inside quotes so easy to find a row’s lines in notepad or similar.
  • Add SepReaderOptions.DisableColCountCheck option that allows disabling exception thrown if a row does not have expected col count (fixes issue #10).
  • BREAKING CHANGE: Rename SepReaderOptions.UseFastFloat to DisableFastFloat to follow a Disable convention for features enabled by default.
  • Minor breaking change to SepWriter, which now always writes line ending for last row to ensure row count consistent for empty columns.
  • Expand README.md with options and debuggability.

Even Faster Parsing

For detailed benchmarks incl. new ARM64 benchmarks on Neoverse N1 see GitHub. Below I instead try to highlight the improvements from Sep 0.1.0 to 0.2.0 and Sylvan 1.3.1 to 1.3.2. Sylvan sees an impressive 1.66x speedup and Sep 1.39x at the Row level. A win-win scenario!

Sep now hits ~10 GB/s on an AMD 5950X Zen 3 CPU when counting 2 bytes per char. At 57.3 ns/row for a row with 25 columns for package assets benchmark this means Sep spends just 2.3 ns or ~11 cycles (~5 GHz) per column for parsing only.

Method Scope Version MB/s ns/row Version MB/s ns/row Speedup
Sep Row 0.1.0 7335.3 79.6 0.2.0 10191.6 57.3 1.39
Sylvan Row 1.3.1 4952.1 117.9 1.3.2 8223.5 71.0 1.66
Sep Cols 0.1.0 6119.8 95.4 0.2.0 7815.7 74.7 1.28
Sylvan Cols 1.3.1 3546.0 164.6 1.3.2 4573.7 127.6 1.29
Sep Asset 0.1.0 775.0 753.3 0.2.0 810.6 720.2 1.05
Sylvan Asset 1.3.1 609.7 957.5 1.3.2 648.3 900.5 1.06

SepReader Debuggability (Row Line Number Range)

Debuggability is an important part of any library and while this is still a work in progress for Sep, SepReader does have a unique feature when looking at it’s row in a debug context. Given the below example code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
var text = """
           Key;Value
           A;"1
           2
           3"
           B;"Apple
           Banana
           Orange
           Pear"
           """;
using var reader = Sep.Reader().FromText(text);
foreach (var row in reader)
{
    // Hover over row when breaking here
    if (Debugger.IsAttached && row.RowIndex == 2) { Debugger.Break(); }
}

and you are hovering over row when the break is triggered then this will show something like:

1
  2:[5..9] = 'B;"Apple\r\nBanana\r\nOrange\r\nPear"

This has the format shown below.

1
<ROWINDEX>:[<LINENUMBERRANGE>] = '<ROW>'

Note how this shows line number range [FromIncl..ToExcl], as in C# range expression, so that one can easily find the row in question in notepad or similar. This means Sep has to track line endings inside quotes and is an example of a feature that makes Sep a bit slower but which is a price considered worth paying.

2023.08.07