Sep 0.6.0 - CSV Trim Support, .NET 9 and New Benchmarks incl. Apple M1

It’s been a while since the last update on Sep, but recently 0.6.0 was released with the following notable changes:

  • 🐛 Bug fix to SepWriter when selecting multiple columns by indices
  • ✂️ SepReader trim support
  • 🤖 .NET 9 ready
  • 🚀 Updated and new benchmarks incl. Apple M1

See v0.6.0 release for all changes and Sep README on GitHub for full details. Below I’ll go over the notable changes briefly. Keep reading for perf numbers!

Bug Fix to SepWriter

Yet another reminder that great code coverage (Sep has ~100% now) does not preclude any bugs… a functioning brain should. However, sometimes a usually reasonably well functioning brain has a day off and in that case wrote a test wrong which meant a bug snuck into SepWriter.Row when selecting multiple columns by indices e.g. for:

1
var cols = row[3, 2];

one would expect columns at indices 3 and 2 to be selected, but instead indices 0 and 1 were selected. A simple mistake fixed by using colIndices[i] instead of just i. I would guess no one have actually used this, or hope. At least there’s been no such usage at work.

Trim Support

Sep now supports trimming by the SepTrim flags enum, which has two options as documented in the code. To enable trimming set the option like below:

1
2
using var reader = Sep.Reader(o => 
    o with { Trim = SepTrim.All }).FromText(text);

Below the result of both trimming and unescaping is shown in comparison to CsvHelper. Note unescaping is enabled for all results shown. It is possible to trim without unescaping too, of course.

As can be seen Sep supports a simple principle of trimming before and after unescaping with trimming before unescaping being important for unescaping if there is a starting quote after spaces.

Input CsvHelper Trim CsvHelper InsideQuotes CsvHelper All¹ Sep Outer Sep AfterUnescape Sep All²
a a a a a a a
·a a ·a a a a a
a a a a a
·a· a ·a· a a a a
·a·a· a·a ·a·a· a·a a·a a·a a·a
"a" a a a a a a
"·a" ·a a a ·a a a
"a·" a a a a
"·a·" ·a· a a ·a· a a
"·a·a·" ·a·a· a·a a·a ·a·a· a·a a·a
·"a"· a ·"a"· a a "a" a
·"·a"· ·a ·"·a"· a ·a "·a" a
·"a·"· ·"a·"· a "a·" a
·"·a·"· ·a· ·"·a·"· a ·a· "·a·" a
·"·a·a·"· ·a·a· ·"·a·a·"· a·a ·a·a· "·a·a·" a·a

· (middle dot) is whitespace to make this visible

¹ CsvHelper with TrimOptions.Trim | TrimOptions.InsideQuotes

² Sep with SepTrim.All = SepTrim.Outer | SepTrim.AfterUnescape in SepReaderOptions

Trimming has a cost, of course, benchmarks below show this for AMD Ryzen 9 5950X when accessing the column Span for the package assets benchmark where most (but not all) columns have been prefixed and suffixed with ·"·, which then needs to be removed. For details on benchmarks see Sep on GitHub. As the numbers show Sep is about 11.20 / 1.44 = 7.78x to 11.07 / 1.69 = 6.54x faster than CsvHelper for this scenario. Sylvan does not appear to support trimming.

1
2
3
4
5
6
7
8
| Method                     | Scope | Rows  | Mean      | Ratio | MB | MB/s   | ns/row | Allocated | Alloc Ratio |
|--------------------------- |------ |------ |----------:|------:|---:|-------:|-------:|----------:|------------:|
| Sep_                       | Cols  | 50000 |  8.599 ms |  1.00 | 41 | 4857.5 |  172.0 |   1.04 KB |        1.00 |
| Sep_Trim                   | Cols  | 50000 | 12.402 ms |  1.44 | 41 | 3368.0 |  248.0 |   1.05 KB |        1.01 |
| Sep_TrimUnescape           | Cols  | 50000 | 13.201 ms |  1.54 | 41 | 3164.1 |  264.0 |   1.06 KB |        1.02 |
| Sep_TrimUnescapeTrim       | Cols  | 50000 | 14.568 ms |  1.69 | 41 | 2867.3 |  291.4 |   1.07 KB |        1.02 |
| CsvHelper_TrimUnescape     | Cols  | 50000 | 96.272 ms | 11.20 | 41 |  433.9 | 1925.4 | 451.52 KB |      432.51 |
| CsvHelper_TrimUnescapeTrim | Cols  | 50000 | 95.183 ms | 11.07 | 41 |  438.8 | 1903.7 | 445.86 KB |      427.09 |

These benchmarks were run using .NET 9, but there is no big difference to .NET 8, which brings us to Sep and .NET 9.

.NET 9 Ready

Sep now uses the latest .NET 9 SDK for development and besides still targeting net7.0 and net8.0, it now also targets net9.0. The main reason being to add allows ref struct annotations and support params ReadOnlySpan<>.

The former means SepReader now for .NET 9+ actually implements IEnumerable<SepReader.Row>. However, this isn’t particularly useful yet since .NET apparently hasn’t updated LINQ extension methods to have allows ref struct for TSource in any such methods. Someday perhaps. 🤷

The latter means for most indexing operations params is supported and one can write e.g. row[1, 2, 3] or row["A", "B", "C"]. Straight-forward and without allocation since the params is ReadOnlySpan<int> or ReadOnlySpan<string> and backed by stack allocated storage.

For more details on the API and usage, please refer to the README.

Updated and New Benchmarks

As part of updating for .NET 9 all benchmarks have been re-run and a new set of benchmark “machines” has been used as listed below. (Virtual) below means this machine is actually a GitHub CI agent machine, hence, it is subject to noisy neighbors and is only a subset of the cores of any full the CPU.

  • AMD EPYC 7763 (Virtual) X64 Platform Information
    1
    2
    
    OS=Ubuntu 22.04.5 LTS (Jammy Jellyfish)
    AMD EPYC 7763, 1 CPU, 4 logical and 2 physical cores
    
  • AMD Ryzen 7 PRO 7840U (Laptop on battery) X64 Platform Information
    1
    2
    3
    
    OS=Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3)
    AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics, 
    1 CPU, 16 logical and 8 physical cores
    
  • AMD 5950X (Desktop) X64 Platform Information
    1
    2
    
    OS=Windows 10 (10.0.19044.2846/21H2/November2021Update)
    AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
    
  • Apple M1 (Virtual) ARM64 Platform Information
    1
    2
    
    OS=macOS Sonoma 14.7.1 (23H222) [Darwin 23.6.0]
    Apple M1 (Virtual), 1 CPU, 3 logical and 3 physical cores
    

This means the previous Neoverse M1 ARM64 processor benchmarks have been replaced with the Apple M1 processor. Results for this on the floats benchmark show Sep is ~3x-7x faster than others from low level to top level, as seen below. At the lowest level of just parsing the CSV e.g. the rows, Sep hits ~5 GB/s vs around 1 GB/s for others.

Apple M1 Floats Benchmarks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| Method    | Scope  | Rows  | Mean       | Ratio | MB | MB/s   | ns/row | Allocated   | Alloc Ratio |
|---------- |------- |------ |-----------:|------:|---:|-------:|-------:|------------:|------------:|
| Sep______ | Row    | 25000 |   3.887 ms |  1.00 | 20 | 5215.5 |  155.5 |      1.2 KB |        1.00 |
| Sylvan___ | Row    | 25000 |  17.956 ms |  4.62 | 20 | 1129.0 |  718.2 |    10.62 KB |        8.87 |
| ReadLine_ | Row    | 25000 |  14.074 ms |  3.62 | 20 | 1440.4 |  563.0 | 73489.65 KB |   61,381.24 |
| CsvHelper | Row    | 25000 |  27.741 ms |  7.14 | 20 |  730.8 | 1109.6 |    20.28 KB |       16.94 |
|           |        |       |            |       |    |        |        |             |             |
| Sep______ | Cols   | 25000 |   4.726 ms |  1.00 | 20 | 4289.0 |  189.1 |      1.2 KB |        1.00 |
| Sylvan___ | Cols   | 25000 |  20.241 ms |  4.28 | 20 | 1001.5 |  809.6 |    10.62 KB |        8.84 |
| ReadLine_ | Cols   | 25000 |  14.976 ms |  3.17 | 20 | 1353.6 |  599.0 | 73489.65 KB |   61,181.63 |
| CsvHelper | Cols   | 25000 |  29.842 ms |  6.31 | 20 |  679.3 | 1193.7 |  21340.5 KB |   17,766.40 |
|           |        |       |            |       |    |        |        |             |             |
| Sep______ | Floats | 25000 |  24.511 ms |  1.00 | 20 |  827.1 |  980.4 |     8.34 KB |        1.00 |
| Sep_MT___ | Floats | 25000 |   9.422 ms |  0.38 | 20 | 2151.5 |  376.9 |    79.89 KB |        9.58 |
| Sylvan___ | Floats | 25000 |  69.902 ms |  2.85 | 20 |  290.0 | 2796.1 |    18.57 KB |        2.23 |
| ReadLine_ | Floats | 25000 |  79.015 ms |  3.22 | 20 |  256.6 | 3160.6 |  73493.2 KB |    8,816.43 |
| CsvHelper | Floats | 25000 | 104.811 ms |  4.28 | 20 |  193.4 | 4192.4 | 22063.34 KB |    2,646.77 |

.NET 9 Performance and DATAS Issue

If you are wondering whether .NET 9 provides any significant performance improvements over .NET 8 for Sep, then no. Some minor within 5-10% improvements have been observed, but also minor regressions. This is expected as Sep has been thoroughly optimized to get as good as possible machine code as possible, so while JIT improvements can improve this, there is not much left on the table.

However, as detailed in .NET 8.0.10 vs 9.0.0 RC2 GC Server Performance Regression in Sep (CSV Parser) Benchmark (due to DATAS default) the GC has switched to enable DATAS (Dynamically Adapting To Application Sizes) by default when using Server GC in .NET 9.

Generally, this means the GC is more aggressive with regards to running garbage collections. However, for bursty workloads like Sep’s CSV parsing and the package assets benchmark where 1 million instances of a parsed package asset rows is accumulated, this can hurt performance by up to 1.7x (for parallel enumeration) in my testing. Hence, for Sep benchmarks using GC Server mode, DATAS has been disabled, and I would recommend doing the same if you have lots of RAM, your own machine and are running machine learning pipelines like we are.

That’s all!

2024.12.07