Sep 0.2.3 - .NET 8 and AVX-512 Preview
Since Sep 0.2.0 - Even Faster Parsing (~10 GB/s on Zen 3) and More, there have been a few minor releases of Sep. I will cover those in this post with a focus on preview .NET 8 AVX-512 support and related performance improvements.
Release Notes
- Sep 0.2.1
- Reduce row count in comparison benchmarks for faster runs. Reduces memory bandwidth impact so better shows code optimization efforts.
- Sep 0.2.2
- Add
Reader(...)/Writer(...)
extension method overloads withconfigure
. Fixes issue How to create reader with separator and options? reported by nwiman, thanks!
- Add
- Sep 0.2.3
- Add preview .NET 8 support with AVX-512 parsing and add benchmark results to git incl. Intel Xeon Silver 4316 with AVX-512 results.
- Performance improvements range from minor 1.05x-1.10x to some very impressive ~1.5x improvements for float parsing in .NET 8.
.NET 8 and AVX-512 Benchmark Results
For detailed benchmarks see GitHub and the new
benchmarks
directory that contains BenchmarkDotNet outputs for each hardware
platform that I have access to and have benchmarked (only the 4316 has AVX-512):
AMD 5950X
X64 Platform Information1 2
OS=Windows 10 (10.0.19044.2846/21H2/November2021Update) AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
Intel Xeon Silver 4316
X64 Platform Information1 2 3
OS=Windows 10 (10.0.17763.3287/1809/October2018Update/Redstone5) Intel Xeon Silver 4316 CPU 2.30GHz, 1 CPU, 40 logical and 20 physical cores
Neoverse N1
ARM64 Platform Information (cloud instance)1 2
OS=ubuntu 22.04 Neoverse N1, ARM, 4 vCPU
And .NET runtimes:
- .NET 7.0.10 (7.0.1023.36312)
- .NET 8.0.0 Preview 7 (8.0.23.37506)
For this post I am going to focus on Sep results only and what improvements .NET 8 and AVX-512 appear to have. All details can be seen on GitHub incl. description of the different benchmarks.
The results below are a bit murky with some outliers that are likely just noise. However, overall I make the following observations:
- On the low level scopes (Row and Cols) .NET provides a nice 1.05-1.10x improvements. Given I have been spending hours just shaving of a few percentages at the low level parsing that is quite significant.
- On PackageAssets with just separators and line ends AVX-512 gives about 5% improvement. However, when there are quotes and about 3x more special characters to find there is a 20-30% improvement on top of .NET 8.
- Finally and most significantly .NET 8 appears to significantly improve csFastFloat float parsing by 1.2-1.5x! 🚀 I guess this is due to JIT improvements in .NET 8 as there are no code changes. Perhaps pointing to that some code in csFastFloat could be improved on older runtimes.
PackageAssets
Hardware | Scope | .NET | MB/s | ns/row | .NET | MB/s | ns/row | Speedup |
---|---|---|---|---|---|---|---|---|
5950X | Row | 7.0.10 | 11375.3 | 51.3 | 8.0 P7 | 11872.7 | 49.2 | 1.04 |
4316 | Row | 7.0.10 | 4905.8 | 119.0 | 8.0 P7 | 5363.8 | 108.8 | 1.09 |
N1 | Row | 7.0.10 | 2393.9 | 243.0 | 8.0 P7 | 2421.2 | 240.3 | 1.01 |
5950X | Cols | 7.0.10 | 8769.4 | 66.6 | 8.0 P7 | 9153.5 | 63.8 | 1.04 |
4316 | Cols | 7.0.10 | 3942.7 | 148.0 | 8.0 P7 | 4366.8 | 133.7 | 1.11 |
N1 | Cols | 7.0.10 | 2002.8 | 290.5 | 8.0 P7 | 2042.8 | 284.8 | 1.02 |
5950X | Asset | 7.0.10 | 949.6 | 614.6 | 8.0 P7 | 960.1 | 607.9 | 1.01 |
4316 | Asset | 7.0.10 | 497.1 | 1174.2 | 8.0 P7 | 524.2 | 1113.4 | 1.05 |
N1 | Asset | 7.0.10 | 414.5 | 1403.3 | 8.0 P7 | 429.5 | 1354.3 | 1.04 |
PackageAssets with Quotes
Hardware | Scope | .NET | MB/s | ns/row | .NET | MB/s | ns/row | Speedup |
---|---|---|---|---|---|---|---|---|
5950X | Row | 7.0.10 | 4501.8 | 148.3 | 8.0 P7 | 4928.3 | 135.5 | 1.09 |
4316 | Row | 7.0.10 | 2206.5 | 302.5 | 8.0 P7 | 2705.0 | 246.8 | 1.23 |
N1 | Row | 7.0.10 | 1306.5 | 509.5 | 8.0 P7 | 1420.5 | 468.6 | 1.09 |
5950X | Cols | 7.0.10 | 4321.6 | 154.5 | 8.0 P7 | 4557.6 | 146.5 | 1.05 |
4316 | Cols | 7.0.10 | 2112.6 | 316.0 | 8.0 P7 | 2140.6 | 311.8 | ᵃ1.01 |
N1 | Cols | 7.0.10 | 1175.9 | 566.1 | 8.0 P7 | 1295.8 | 513.7 | 1.10 |
5950X | Asset | 7.0.10 | 875.4 | 762.6 | 8.0 P7 | 965.1 | 691.7 | 1.10 |
4316 | Asset | 7.0.10 | 468.9 | 1423.8 | 8.0 P7 | 639.6 | 1043.7 | 1.36 |
N1 | Asset | 7.0.10 | 377.4 | 1764.0 | 8.0 P7 | 392.1 | 1697.7 | 1.04 |
ᵃ) No doubt an outlier. Haven’t had time to rerun the benchmark. Runs will have variations.
Floats
Hardware | Scope | .NET | MB/s | ns/row | .NET | MB/s | ns/row | Speedup |
---|---|---|---|---|---|---|---|---|
5950X | Row | 7.0.10 | 10137.6 | 107.6 | 8.0 P7 | 10445.9 | 104.4 | 1.03 |
4316 | Row | 7.0.10 | 5033.3 | 216.7 | 8.0 P7 | 4667.7 | 233.6 | ᵇ0.93 |
N1 | Row | 7.0.10 | 2359.4 | 461.4 | 8.0 P7 | 2339.4 | 465.3 | 0.99 |
5950X | Cols | 7.0.10 | 8804.5 | 123.9 | 8.0 P7 | 8912.6 | 122.4 | 1.01 |
4316 | Cols | 7.0.10 | 3939.4 | 276.8 | 8.0 P7 | 4078.0 | 267.4 | 1.04 |
N1 | Cols | 7.0.10 | 2035.0 | 535.0 | 8.0 P7 | 1980.9 | 549.6 | 0.97 |
5950X | Floats | 7.0.10 | 824.4 | 1322.8 | 8.0 P7 | 1210.6 | 900.8 | 1.47 |
4316 | Floats | 7.0.10 | 441.2 | 2471.6 | 8.0 P7 | 617.9 | 1764.9 | 1.40 |
N1 | Floats | 7.0.10 | 419.9 | 2592.6 | 8.0 P7 | 510.0 | 2134.7 | 1.21 |
ᵇ) No doubt an outlier.
AVX-512 Assembly Peek
Here a quick look at the assembly generated by the JIT for the AVX-512 code for
the SepParserAvx512PackCmpOrMoveMaskTzcnt
parser. The actual C# code is very
similar to the AVX2 code, see GitHub for that. Main issue for Sep was some
related code had to be extended to support 64-bit masks.
Below the assembly for the inner loop of the parser is shown. This loads two
512-bit vectors from memory, packs them into 8-bit values and permutes them to
restore order as with AVX2. What is different here is the use of the mask
registers (k1-k8
) introduced with
AVX-512. However, .NET 8 does not have
explicit support for these and the code generation is a bit suboptimal, given
mask register are moved to normal registers each time. And then back.
This should improve with .NET 9. Although, it is still the intent to not support explicit mask registers since this would balloon the number of overloads needed in the vector API. See post by Tanner Gooding on mastodon.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
G_M000_IG03: ;; offset=00C9H
vmovups zmm5, zmmword ptr [rdx]
vpackuswb zmm5, zmm5, zmmword ptr [rdx+40H]
vpermq zmm5, zmm4, zmm5
vpcmpeqb k1, zmm5, zmm0
vpmovm2b zmm6, k1
vpcmpeqb k1, zmm5, zmm1
vpmovm2b zmm7, k1
vpcmpeqb k1, zmm5, zmm2
vpmovm2b zmm8, k1
vpcmpeqb k1, zmm5, zmm3
vpmovm2b zmm5, k1
vpord zmm16, zmm6, zmm7
vpord zmm16, zmm5, zmm16
vpord zmm17, zmm16, zmm8
vpmovb2m k1, zmm17
kmovq r12, k1
test r12, r12
je G_M000_IG20
G_M000_IG04: ;; offset=0135H
vpmovb2m k1, zmm5
kmovq rbx, k1
lea r11, [r12+rdi]
cmp rbx, r11
jne SHORT G_M000_IG07
mov r11d, r8d
align [1 bytes for IG05]
G_M000_IG05: ;; offset=0150H
xor r12d, r12d
tzcnt r12, rbx
blsr rbx, rbx
add r13, 4
add r12d, r11d
mov dword ptr [r13], r12d
test rbx, rbx
jne SHORT G_M000_IG05