Sep 0.3.0 - Unescape Support (still the Most Efficient .NET CSV Parser)

Alright, alright, alright! This blog post announces the release of Sep 0.3.0 with unescape support. Sooner than I had imagined given we don’t really need it, and primarily as a consequence of pride and vanity. Not the least given that in Joel Verhagen’s updated blog post, The fastest CSV parser in .NET, there is a big fat caveat attached to Sep:

Fastest Csv Parseer In Dotnet

Well, now Sep can unescape if you ask it to with the Unescape option. Here I will focus on the new unescape support, but other releases after 0.2.3 contain a number of changes including some minor breaking changes to make the API more consistent with regards to naming. See Releases for notes on those changes. Notably the README now contains the full public API definition courtesy of the PublicApiGenerator library at the bottom.

Unescaping

To enable unescaping all you have to do is write something like:

1
2
3
using var reader = Sep
  .Reader(o => o with { Unescape = true })
  .FromText(text);

And Sep will then unquote and unescape quotes in columns upon access.

The algorithm employed by Sep is fairly simple. If a column starts with a quote then the two outermost quotes are removed and every second inner quote is removed. Note that unquote/unescape happens in-place, which means the SepReader.Row.Span will be modified and contain “garbage” state after unescaped cols before next col. This is for efficiency to avoid allocating secondary memory for unescaped columns. Header columns/names will also be unescaped.

This may not suite everyones needs. This mainly relates to how invalidly escaped columns are handled, so lets take a look at a comparison of unescaped output from CsvHelper, Sylvan and Sep as is also documented in the Sep README.

The below table tries to summarize the behavior of Sep vs CsvHelper and Sylvan. Note that all do the same for valid input. There are differences for how invalid input is handled. For Sep the design choice has been based on not wanting to throw exceptions and to use a principle that is both reasonably fast and simple.

Input Valid CsvHelper CsvHelper¹ Sylvan Sep²
a True a a a a
"" True        
"""" True " " " "
"""""" True "" "" "" ""
"a" True a a a a
"a""a" True a"a a"a a"a a"a
"a""a""a" True a"a"a a"a"a a"a"a a"a"a
a""a False EXCEPTION a""a a""a a""a
a"a"a False EXCEPTION a"a"a a"a"a a"a"a
·""· False EXCEPTION ·""· ·""· ·""·
·"a"· False EXCEPTION ·"a"· ·"a"· ·"a"·
·"" False EXCEPTION ·"" ·"" ·""
·"a" False EXCEPTION ·"a" ·"a" ·"a"
a"""a False EXCEPTION a"""a a"""a a"""a
"a"a"a" False EXCEPTION aa"a" a"a"a aa"a
""· False EXCEPTION · " ·
"a"· False EXCEPTION a"
"a"""a False EXCEPTION aa EXCEPTION a"a
"a"""a" False EXCEPTION aa" a"a<NULL> a"a"
""a" False EXCEPTION a" "a a"
"a"a" False EXCEPTION aa" a"a aa"
""a"a"" False EXCEPTION a"a"" "a"a" a"a"
""" False     EXCEPTION "
""""" False " " EXCEPTION ""

· (middle dot) is whitespace to make this visible

¹ CsvHelper with BadDataFound = null

² Sep with Unescape = true in SepReaderOptions

If the Sep output does not suite your needs consider using one of the other great CSV libraries out there, or use the Col.Span property and do your own unescaping.

Now how is perf then if unescaping is enabled? It’s still great, but of course there is a cost. For all the gory details and performance and platforms check the README, here I will focus on performance on ARM since that is often overlooked even in this day and age.

Neoverse.N1 - PackageAssets with Quotes Benchmark Results (Sep 0.3.0.0, Sylvan 1.3.5.0, CsvHelper 30.0.1.0)
Method Scope Rows Mean Ratio MB MB/s ns/row Allocated Alloc Ratio
Sep____ Row 50000 24.35 ms 1.00 33 1366.9 487.0 941 B 1.00
Sep_Unescape Row 50000 24.21 ms 0.99 33 1374.7 484.2 941 B 1.00
Sylvan___ Row 50000 39.92 ms 1.64 33 833.6 798.5 6288 B 6.68
ReadLine_ Row 50000 45.96 ms 1.89 33 724.1 919.3 111389508 B 118,373.55
CsvHelper Row 50000 106.99 ms 4.40 33 311.1 2139.8 21272 B 22.61
                   
Sep____ Cols 50000 26.96 ms 1.00 33 1234.3 539.3 869 B 1.00
Sep_Unescape Cols 50000 28.60 ms 1.06 33 1163.6 572.1 953 B 1.10
Sylvan___ Cols 50000 45.48 ms 1.69 33 731.8 909.6 6318 B 7.27
ReadLine_ Cols 50000 45.95 ms 1.70 33 724.4 918.9 111389521 B 128,181.27
CsvHelper Cols 50000 152.46 ms 5.65 33 218.3 3049.1 457328 B 526.27
                   
Sep____ Asset 50000 83.86 ms 1.00 33 396.9 1677.2 14139912 B 1.00
Sep_Unescape Asset 50000 79.99 ms 0.96 33 416.1 1599.9 14132628 B 1.00
Sylvan___ Asset 50000 110.29 ms 1.31 33 301.8 2205.9 14298320 B 1.01
ReadLine_ Asset 50000 178.89 ms 2.13 33 186.1 3577.7 125240480 B 8.86
CsvHelper Asset 50000 182.59 ms 2.18 33 182.3 3651.7 14308800 B 1.01

As can be seen above Sep is ~1.6 faster than Sylvan (the second most efficient CSV parser with many features Sep does not have) at the lowest level of parsing rows or just cols. For that case the impact of unescaping is a minor 6%, but once you then get to the top level test of parsing package assets information this speeds up the benchmark since removing quotes reduces hashing costs and therefore pooling cost employed in this benchmark. Note that Sep is 4-5x faster than CsvHelper. The read line approach does not unescape so is only there as a form of basic baseline. I hope this addresses that big fat caveat.

Efficiency vs Parallization

There is one big “issue” still, though, and that is that Sep isn’t #1 in the above ranking, since there now is a multi-threaded contender, RecordParser! Nice! I always prefer improving efficiency first before going to multi-threading, but parallizing CSV “parsing”, with parsing in quotes since it’s more related to the parsing of text after CSV parsing, for real work loads is a real need that we have at my company too, and it is relevant for machine learning where you can have very big CSV files with lots of floats.

I have tried reproducing the results on my own machine, but in my testing Sep still outperforms (significantly) RecordParser on a 16c/32t where Sep uses just 1 thread as shown below:

Sep Vs Recordparser

where RecPrs_MT is RecordParser in parallized mode. Of course, my benchmark uses fewer rows and is on a different machine. As always do your own benchmarks for your own specific scenario.

I have been working on parallization with a focus on float parsing for Sep, but it wasn’t as efficient as I wanted, so I tabled in. Given the competition I will revisit that and hopefully address that in the next Sep post. Until then I hope you are enjoying the new .NET 8 release, as well 🚀

2023.11.27