Sep 0.2.3 - .NET 8 and AVX-512 Preview

Since Sep 0.2.0 - Even Faster Parsing (~10 GB/s on Zen 3) and More, there have been a few minor releases of Sep. I will cover those in this post with a focus on preview .NET 8 AVX-512 support and related performance improvements.

Release Notes

Sep 0.2.1
- Reduce row count in comparison benchmarks for faster runs. Reduces memory bandwidth impact so better shows code optimization efforts.
Sep 0.2.2
- Add Reader(...)/Writer(...) extension method overloads with configure. Fixes issue How to create reader with separator and options? reported by nwiman, thanks!
Sep 0.2.3
- Add preview .NET 8 support with AVX-512 parsing and add benchmark results to git incl. Intel Xeon Silver 4316 with AVX-512 results.
- Performance improvements range from minor 1.05x-1.10x to some very impressive ~1.5x improvements for float parsing in .NET 8.

.NET 8 and AVX-512 Benchmark Results

For detailed benchmarks see GitHub and the new benchmarks directory that contains BenchmarkDotNet outputs for each hardware platform that I have access to and have benchmarked (only the 4316 has AVX-512):

AMD 5950X X64 Platform Information

OS=Windows 10 (10.0.19044.2846/21H2/November2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores

Intel Xeon Silver 4316 X64 Platform Information

OS=Windows 10 (10.0.17763.3287/1809/October2018Update/Redstone5)
Intel Xeon Silver 4316 CPU 2.30GHz, 1 CPU, 
40 logical and 20 physical cores

Neoverse N1 ARM64 Platform Information (cloud instance)

OS=ubuntu 22.04
Neoverse N1, ARM, 4 vCPU

And .NET runtimes:

.NET 7.0.10 (7.0.1023.36312)
.NET 8.0.0 Preview 7 (8.0.23.37506)

For this post I am going to focus on Sep results only and what improvements .NET 8 and AVX-512 appear to have. All details can be seen on GitHub incl. description of the different benchmarks.

The results below are a bit murky with some outliers that are likely just noise. However, overall I make the following observations:

On the low level scopes (Row and Cols) .NET provides a nice 1.05-1.10x improvements. Given I have been spending hours just shaving of a few percentages at the low level parsing that is quite significant.
On PackageAssets with just separators and line ends AVX-512 gives about 5% improvement. However, when there are quotes and about 3x more special characters to find there is a 20-30% improvement on top of .NET 8.
Finally and most significantly .NET 8 appears to significantly improve csFastFloat float parsing by 1.2-1.5x! 🚀 I guess this is due to JIT improvements in .NET 8 as there are no code changes. Perhaps pointing to that some code in csFastFloat could be improved on older runtimes.

PackageAssets

Hardware	Scope	.NET	MB/s	ns/row	.NET	MB/s	ns/row	Speedup
5950X	Row	7.0.10	11375.3	51.3	8.0 P7	11872.7	49.2	1.04
4316	Row	7.0.10	4905.8	119.0	8.0 P7	5363.8	108.8	1.09
N1	Row	7.0.10	2393.9	243.0	8.0 P7	2421.2	240.3	1.01

5950X	Cols	7.0.10	8769.4	66.6	8.0 P7	9153.5	63.8	1.04
4316	Cols	7.0.10	3942.7	148.0	8.0 P7	4366.8	133.7	1.11
N1	Cols	7.0.10	2002.8	290.5	8.0 P7	2042.8	284.8	1.02

5950X	Asset	7.0.10	949.6	614.6	8.0 P7	960.1	607.9	1.01
4316	Asset	7.0.10	497.1	1174.2	8.0 P7	524.2	1113.4	1.05
N1	Asset	7.0.10	414.5	1403.3	8.0 P7	429.5	1354.3	1.04

PackageAssets with Quotes

Hardware	Scope	.NET	MB/s	ns/row	.NET	MB/s	ns/row	Speedup
5950X	Row	7.0.10	4501.8	148.3	8.0 P7	4928.3	135.5	1.09
4316	Row	7.0.10	2206.5	302.5	8.0 P7	2705.0	246.8	1.23
N1	Row	7.0.10	1306.5	509.5	8.0 P7	1420.5	468.6	1.09

5950X	Cols	7.0.10	4321.6	154.5	8.0 P7	4557.6	146.5	1.05
4316	Cols	7.0.10	2112.6	316.0	8.0 P7	2140.6	311.8	ᵃ1.01
N1	Cols	7.0.10	1175.9	566.1	8.0 P7	1295.8	513.7	1.10

5950X	Asset	7.0.10	875.4	762.6	8.0 P7	965.1	691.7	1.10
4316	Asset	7.0.10	468.9	1423.8	8.0 P7	639.6	1043.7	1.36
N1	Asset	7.0.10	377.4	1764.0	8.0 P7	392.1	1697.7	1.04

ᵃ) No doubt an outlier. Haven’t had time to rerun the benchmark. Runs will have variations.

Floats

Hardware	Scope	.NET	MB/s	ns/row	.NET	MB/s	ns/row	Speedup
5950X	Row	7.0.10	10137.6	107.6	8.0 P7	10445.9	104.4	1.03
4316	Row	7.0.10	5033.3	216.7	8.0 P7	4667.7	233.6	ᵇ0.93
N1	Row	7.0.10	2359.4	461.4	8.0 P7	2339.4	465.3	0.99

5950X	Cols	7.0.10	8804.5	123.9	8.0 P7	8912.6	122.4	1.01
4316	Cols	7.0.10	3939.4	276.8	8.0 P7	4078.0	267.4	1.04
N1	Cols	7.0.10	2035.0	535.0	8.0 P7	1980.9	549.6	0.97

5950X	Floats	7.0.10	824.4	1322.8	8.0 P7	1210.6	900.8	1.47
4316	Floats	7.0.10	441.2	2471.6	8.0 P7	617.9	1764.9	1.40
N1	Floats	7.0.10	419.9	2592.6	8.0 P7	510.0	2134.7	1.21

ᵇ) No doubt an outlier.

AVX-512 Assembly Peek

Here a quick look at the assembly generated by the JIT for the AVX-512 code for the SepParserAvx512PackCmpOrMoveMaskTzcnt parser. The actual C# code is very similar to the AVX2 code, see GitHub for that. Main issue for Sep was some related code had to be extended to support 64-bit masks.

Below the assembly for the inner loop of the parser is shown. This loads two 512-bit vectors from memory, packs them into 8-bit values and permutes them to restore order as with AVX2. What is different here is the use of the mask registers (k1-k8) introduced with AVX-512. However, .NET 8 does not have explicit support for these and the code generation is a bit suboptimal, given mask register are moved to normal registers each time. And then back.

This should improve with .NET 9. Although, it is still the intent to not support explicit mask registers since this would balloon the number of overloads needed in the vector API. See post by Tanner Gooding on mastodon.

G_M000_IG03:                ;; offset=00C9H
       vmovups  zmm5, zmmword ptr [rdx]
       vpackuswb zmm5, zmm5, zmmword ptr [rdx+40H]
       vpermq   zmm5, zmm4, zmm5
       vpcmpeqb k1, zmm5, zmm0
       vpmovm2b zmm6, k1
       vpcmpeqb k1, zmm5, zmm1
       vpmovm2b zmm7, k1
       vpcmpeqb k1, zmm5, zmm2
       vpmovm2b zmm8, k1
       vpcmpeqb k1, zmm5, zmm3
       vpmovm2b zmm5, k1
       vpord    zmm16, zmm6, zmm7
       vpord    zmm16, zmm5, zmm16
       vpord    zmm17, zmm16, zmm8
       vpmovb2m k1, zmm17
       kmovq    r12, k1
       test     r12, r12
       je       G_M000_IG20
 
G_M000_IG04:                ;; offset=0135H
       vpmovb2m k1, zmm5
       kmovq    rbx, k1
       lea      r11, [r12+rdi]
       cmp      rbx, r11
       jne      SHORT G_M000_IG07
       mov      r11d, r8d
       align    [1 bytes for IG05]
 
G_M000_IG05:                ;; offset=0150H
       xor      r12d, r12d
       tzcnt    r12, rbx
       blsr     rbx, rbx
       add      r13, 4
       add      r12d, r11d
       mov      dword ptr [r13], r12d
       test     rbx, rbx
       jne      SHORT G_M000_IG05

2023.09.05