Sep 0.6.0 - CSV Trim Support, .NET 9 and New Benchmarks incl. Apple M1
It’s been a while since the last update on Sep, but recently 0.6.0 was released with the following notable changes:
- 🐛 Bug fix to
SepWriter
when selecting multiple columns by indices - ✂️
SepReader
trim support - 🤖 .NET 9 ready
- 🚀 Updated and new benchmarks incl. Apple M1
See v0.6.0 release for all changes and Sep README on GitHub for full details. Below I’ll go over the notable changes briefly. Keep reading for perf numbers!
Bug Fix to SepWriter
Yet another reminder that great code coverage (Sep has ~100% now) does not
preclude any bugs… a functioning brain should. However, sometimes a usually
reasonably well functioning brain has a day off and in that case wrote a test
wrong which meant a bug snuck into SepWriter.Row
when selecting multiple
columns by indices e.g. for:
1
var cols = row[3, 2];
one would expect columns at indices 3 and 2 to be selected, but instead indices
0 and 1 were selected. A simple mistake fixed by using colIndices[i]
instead
of just i
. I would guess no one have actually used this, or hope. At least
there’s been no such usage at work.
Trim Support
Sep now supports trimming by the
SepTrim
flags
enum, which has two options as documented in the code. To enable trimming set
the option like below:
1
2
using var reader = Sep.Reader(o =>
o with { Trim = SepTrim.All }).FromText(text);
Below the result of both trimming and unescaping is shown in comparison to CsvHelper. Note unescaping is enabled for all results shown. It is possible to trim without unescaping too, of course.
As can be seen Sep supports a simple principle of trimming before and after unescaping with trimming before unescaping being important for unescaping if there is a starting quote after spaces.
Input | CsvHelper Trim | CsvHelper InsideQuotes | CsvHelper All¹ | Sep Outer | Sep AfterUnescape | Sep All² |
---|---|---|---|---|---|---|
a |
a |
a |
a |
a |
a |
a |
·a |
a |
·a |
a |
a |
a |
a |
a· |
a |
a· |
a |
a |
a |
a |
·a· |
a |
·a· |
a |
a |
a |
a |
·a·a· |
a·a |
·a·a· |
a·a |
a·a |
a·a |
a·a |
"a" |
a |
a |
a |
a |
a |
a |
"·a" |
·a |
a |
a |
·a |
a |
a |
"a·" |
a· |
a |
a |
a· |
a |
a |
"·a·" |
·a· |
a |
a |
·a· |
a |
a |
"·a·a·" |
·a·a· |
a·a |
a·a |
·a·a· |
a·a |
a·a |
·"a"· |
a |
·"a"· |
a |
a |
"a" |
a |
·"·a"· |
·a |
·"·a"· |
a |
·a |
"·a" |
a |
·"a·"· |
a· |
·"a·"· |
a |
a· |
"a·" |
a |
·"·a·"· |
·a· |
·"·a·"· |
a |
·a· |
"·a·" |
a |
·"·a·a·"· |
·a·a· |
·"·a·a·"· |
a·a |
·a·a· |
"·a·a·" |
a·a |
·
(middle dot) is whitespace to make this visible
¹ CsvHelper with TrimOptions.Trim | TrimOptions.InsideQuotes
² Sep with SepTrim.All = SepTrim.Outer | SepTrim.AfterUnescape
in
SepReaderOptions
Trimming has a cost, of course, benchmarks below show this for AMD Ryzen 9
5950X
when accessing the column Span
for the package assets benchmark where
most (but not all) columns have been prefixed and suffixed with ·"·
, which
then needs to be removed. For details on benchmarks see Sep on GitHub. As the
numbers show Sep is about 11.20 / 1.44 = 7.78x to 11.07 / 1.69 = 6.54x
faster than CsvHelper for this scenario. Sylvan does not appear to support
trimming.
1
2
3
4
5
6
7
8
| Method | Scope | Rows | Mean | Ratio | MB | MB/s | ns/row | Allocated | Alloc Ratio |
|--------------------------- |------ |------ |----------:|------:|---:|-------:|-------:|----------:|------------:|
| Sep_ | Cols | 50000 | 8.599 ms | 1.00 | 41 | 4857.5 | 172.0 | 1.04 KB | 1.00 |
| Sep_Trim | Cols | 50000 | 12.402 ms | 1.44 | 41 | 3368.0 | 248.0 | 1.05 KB | 1.01 |
| Sep_TrimUnescape | Cols | 50000 | 13.201 ms | 1.54 | 41 | 3164.1 | 264.0 | 1.06 KB | 1.02 |
| Sep_TrimUnescapeTrim | Cols | 50000 | 14.568 ms | 1.69 | 41 | 2867.3 | 291.4 | 1.07 KB | 1.02 |
| CsvHelper_TrimUnescape | Cols | 50000 | 96.272 ms | 11.20 | 41 | 433.9 | 1925.4 | 451.52 KB | 432.51 |
| CsvHelper_TrimUnescapeTrim | Cols | 50000 | 95.183 ms | 11.07 | 41 | 438.8 | 1903.7 | 445.86 KB | 427.09 |
These benchmarks were run using .NET 9, but there is no big difference to .NET 8, which brings us to Sep and .NET 9.
.NET 9 Ready
Sep now uses the latest .NET 9 SDK for development and besides still targeting
net7.0
and net8.0
, it now also targets net9.0
. The main reason being to
add allows ref struct
annotations and support params ReadOnlySpan<>
.
The former means SepReader
now for .NET 9+ actually implements
IEnumerable<SepReader.Row>
. However, this isn’t particularly useful yet since
.NET apparently hasn’t updated LINQ extension methods to have allows ref
struct
for TSource
in any such methods. Someday perhaps. 🤷
The latter means for most indexing operations params
is supported and one can
write e.g. row[1, 2, 3]
or row["A", "B", "C"]
. Straight-forward and without
allocation since the params
is ReadOnlySpan<int>
or ReadOnlySpan<string>
and backed by stack allocated storage.
For more details on the API and usage, please refer to the README.
Updated and New Benchmarks
As part of updating for .NET 9 all benchmarks have been re-run and a new set of
benchmark “machines” has been used as listed below. (Virtual)
below means this
machine is actually a GitHub CI agent machine, hence, it is subject to noisy
neighbors and is only a subset of the cores of any full the CPU.
AMD EPYC 7763
(Virtual) X64 Platform Information1 2
OS=Ubuntu 22.04.5 LTS (Jammy Jellyfish) AMD EPYC 7763, 1 CPU, 4 logical and 2 physical cores
AMD Ryzen 7 PRO 7840U
(Laptop on battery) X64 Platform Information1 2 3
OS=Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3) AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
AMD 5950X
(Desktop) X64 Platform Information1 2
OS=Windows 10 (10.0.19044.2846/21H2/November2021Update) AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
Apple M1
(Virtual) ARM64 Platform Information1 2
OS=macOS Sonoma 14.7.1 (23H222) [Darwin 23.6.0] Apple M1 (Virtual), 1 CPU, 3 logical and 3 physical cores
This means the previous Neoverse M1 ARM64 processor benchmarks have been replaced with the Apple M1 processor. Results for this on the floats benchmark show Sep is ~3x-7x faster than others from low level to top level, as seen below. At the lowest level of just parsing the CSV e.g. the rows, Sep hits ~5 GB/s vs around 1 GB/s for others.
Apple M1 Floats Benchmarks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| Method | Scope | Rows | Mean | Ratio | MB | MB/s | ns/row | Allocated | Alloc Ratio |
|---------- |------- |------ |-----------:|------:|---:|-------:|-------:|------------:|------------:|
| Sep______ | Row | 25000 | 3.887 ms | 1.00 | 20 | 5215.5 | 155.5 | 1.2 KB | 1.00 |
| Sylvan___ | Row | 25000 | 17.956 ms | 4.62 | 20 | 1129.0 | 718.2 | 10.62 KB | 8.87 |
| ReadLine_ | Row | 25000 | 14.074 ms | 3.62 | 20 | 1440.4 | 563.0 | 73489.65 KB | 61,381.24 |
| CsvHelper | Row | 25000 | 27.741 ms | 7.14 | 20 | 730.8 | 1109.6 | 20.28 KB | 16.94 |
| | | | | | | | | | |
| Sep______ | Cols | 25000 | 4.726 ms | 1.00 | 20 | 4289.0 | 189.1 | 1.2 KB | 1.00 |
| Sylvan___ | Cols | 25000 | 20.241 ms | 4.28 | 20 | 1001.5 | 809.6 | 10.62 KB | 8.84 |
| ReadLine_ | Cols | 25000 | 14.976 ms | 3.17 | 20 | 1353.6 | 599.0 | 73489.65 KB | 61,181.63 |
| CsvHelper | Cols | 25000 | 29.842 ms | 6.31 | 20 | 679.3 | 1193.7 | 21340.5 KB | 17,766.40 |
| | | | | | | | | | |
| Sep______ | Floats | 25000 | 24.511 ms | 1.00 | 20 | 827.1 | 980.4 | 8.34 KB | 1.00 |
| Sep_MT___ | Floats | 25000 | 9.422 ms | 0.38 | 20 | 2151.5 | 376.9 | 79.89 KB | 9.58 |
| Sylvan___ | Floats | 25000 | 69.902 ms | 2.85 | 20 | 290.0 | 2796.1 | 18.57 KB | 2.23 |
| ReadLine_ | Floats | 25000 | 79.015 ms | 3.22 | 20 | 256.6 | 3160.6 | 73493.2 KB | 8,816.43 |
| CsvHelper | Floats | 25000 | 104.811 ms | 4.28 | 20 | 193.4 | 4192.4 | 22063.34 KB | 2,646.77 |
.NET 9 Performance and DATAS Issue
If you are wondering whether .NET 9 provides any significant performance improvements over .NET 8 for Sep, then no. Some minor within 5-10% improvements have been observed, but also minor regressions. This is expected as Sep has been thoroughly optimized to get as good as possible machine code as possible, so while JIT improvements can improve this, there is not much left on the table.
However, as detailed in .NET 8.0.10 vs 9.0.0 RC2 GC Server Performance Regression in Sep (CSV Parser) Benchmark (due to DATAS default) the GC has switched to enable DATAS (Dynamically Adapting To Application Sizes) by default when using Server GC in .NET 9.
Generally, this means the GC is more aggressive with regards to running garbage collections. However, for bursty workloads like Sep’s CSV parsing and the package assets benchmark where 1 million instances of a parsed package asset rows is accumulated, this can hurt performance by up to 1.7x (for parallel enumeration) in my testing. Hence, for Sep benchmarks using GC Server mode, DATAS has been disabled, and I would recommend doing the same if you have lots of RAM, your own machine and are running machine learning pipelines like we are.
That’s all!