Outperforming Rayon with OpenMP
Replacing Rayon with OpenMP for additional gains
Nov 16, 2021
rust
c
perf
openmp
blockchain
kzg-proofs
bls12-381
CONTENTS
Introduction
For the Blockchain Technologies course, students were paired into groups and assigned to produce the fastest Rust library (crate) implementing the KZG10 scheme. Two teams used the same backend, that is, blst (implemented in assembly with direct bindings for Rust and C). The first team, blst-from-scratch, was using the said Rust bindings to produce an interface closer to c-kzg, whereas the ckzg team, which I was part of, was responsible for porting the latter over to Rust.
Choosing the right tool for the job
It’s kind of obvious for Rust programmers to pick Rayon
out of the box because there aren’t any other viable options for writing parallel code, except for std::thread
s, but who wants to manually create and manage threads when simpler solutions exist, anyway? I had to make a decision on which technique to go by:
- Use
pthread
s manually - Some random guy’s threadpool library from GitHub with the most stars
OpenMP
I picked OpenMP
because, while experimenting, I found it to produce the best results, and it was a no-brainer to use except for coming up with how to seemingly integrate it with Rust so that it would work on multiple platforms. In the end, I came up with a simple Bash script to automate the whole building and packaging of shared libraries, and, luckily enough, OpenMP was easily integrated into Rust by either:
- exporting the
RUSTFLAGS
environment variable pointing to the correctlibomp
LLVM runtime
1# Linux
2apt install libomp-dev
3export LIBOMP_PATH=$(find /usr/lib/llvm* -name libiomp5.so | head -n 1)
4
5# MacOS
6brew install libomp
7ln -s /usr/local/opt/libomp/lib/libomp.dylib /usr/local/lib
8ln -s /usr/local/opt/libomp/include/omp.h /usr/local/include
9export LIBOMP_PATH=/usr/local/lib/libomp.dylib
10
11# And finally
12export RUSTFLAGS="-C link-arg=$LIBOMP_PATH"
- or creating a
.cargo/config.toml
file inside the project directory and mentioning it there
1[build]
2rustflags = [
3 "-C", "link-arg=PATH_TO_LIBOMP_SO_OR_DYLIB"
4]
Simple!
Searching for bottlenecks
The general procedure for looking for what to parallelize is to use CPU profiling tools such as Perf
that produce flamegraphs, which are a really nice visual way to represent the CPU time of a program. Below is the flamegraph that was generated by running the c-kzg’s fft_g1
benchmark:
which, as I’ve noticed, had a huge impact on performance with the majority of calls tracing down to assembly code; a similar kzg implementation library called go-kzg that was taken as a reference a few times while producing unit tests and benchmarks, showed that their fft_g1
benchmark took the longest time to execute as well, among others.
So, we have 7 groups of benchmarks:
- das
- fft
- fk20
- kzg
- poly
- recover
- zero_poly
of which let’s pick fft_g1
from the fft
group to parallelize out!
Parallelizing fft_g1
The fft_g1
function in all cases calls the fft_g1_fast
function, which applies the divide-and-conquer principle to divide a large problem into smaller subproblems, recursively solving each of them. The general procedure here is to distribute work (fft_f1_fast
s) among worker threads.
The blst-from-scratch team implemented it as follows:
1let (lo, hi) = ret.split_at_mut(half);
2rayon::join(
3 || fft_g1_fast(lo, data, stride * 2, roots, roots_stride * 2),
4 || fft_g1_fast(hi, &data[stride..], stride * 2, roots, roots_stride * 2)
5);
6
7for i in 0..half {
8 let y_times_root = ret[i + half].mul(&roots[i * roots_stride]);
9 ret[i + half] = ret[i].sub(&y_times_root);
10 ret[i] = ret[i].add_or_dbl(&y_times_root);
11}
As a side note, rayon::join
spawns two threads, one executing each of the two closures.
The C equivalent, on the other hand, was as follows:
1#pragma omp parallel sections
2{
3 #pragma omp section
4 {
5 fft_g1_fast(out, in, stride * 2, roots, roots_stride * 2, half);
6 }
7 #pragma omp section
8 {
9 fft_g1_fast(out + half, in + stride, stride * 2, roots, roots_stride * 2, half);
10 }
11}
12#pragma omp parallel
13#pragma omp for
14for (uint64_t i = 0; i < half; i++) {
15 g1_t y_times_root;
16 g1_mul(&y_times_root, &out[i + half], &roots[i * roots_stride]);
17 g1_sub(&out[i + half], &out[i], &y_times_root);
18 g1_add_or_dbl(&out[i], &out[i], &y_times_root);
19}
In addition to parallel sections, I also utilized OpenMP’s parallel for-loop, because I noticed it yielded a 5% greater performance on my personal machine. Considering the ubuntu-latest
runner in GitHub Actions CI had only two available threads, the halves of the problem were shared among those two threads where each ran the for-loop to do arithmetic operations on polynomial G1
points.
In the above code snippets, fft_g1
calls fft_g1_fast
, which up to scale 16 should at most 1 << 15
times recursively call itself, where each such call will be distributed among the 2 threads. Since we’re computing fft_g1
up to scale 8, there should be (1 << 7) + 1
tasks (not to be confused by OpenMP’s task
pragma directive!) for fft_g1_fast
or 129
such tasks that will be run in parallel!
Local c-kzg benchmark
Running on my personal laptop with i5-7300HQ (4 threads overclocked @ 3.50GHz), all mitigations turned off, and a custom Liquorix kernel, I was able to produce the following results:
Original c-kzg library | Parallelized c-kzg library |
---|---|
|
|
That’s twice as fast with as little effort as putting in a few pragmas!
GitHub Actions CI benchmarks
The fft_g1
benchmark was limited to scale 7 because the overall run time for the job exceeds the 6 hour limit if I were to benchmark it up to scale 16, as Criterion runs each iteration a couple of hundred times to produce more accurate results, and that used to automatically cancel other running tasks as jobs submitted to GitHub Actions have a limit of 360 minutes.
Benchmarking blst-from-scratch
From the above screenshot we can see that the parallelized version of the produced library ran 1m 28s
shorter than its sequential version, and below are the results of sequential fft_g1
algorithm:
1Benchmarking bench_fft_g1 scale: '7'
2Benchmarking bench_fft_g1 scale: '7': Warming up for 3.0000 s
3Benchmarking bench_fft_g1 scale: '7': Collecting 100 samples in estimated 6.6364 s (200 iterations)
4Benchmarking bench_fft_g1 scale: '7': Analyzing
5bench_fft_g1 scale: '7' time: [33.423 ms 33.785 ms 34.150 ms]
of which the average run time for scale 7 was cut down by 38.926%
by its parallel counterpart:
1Benchmarking bench_fft_g1 scale: '7'
2Benchmarking bench_fft_g1 scale: '7': Warming up for 3.0000 s
3Benchmarking bench_fft_g1 scale: '7': Collecting 100 samples in estimated 6.3282 s (300 iterations)
4Benchmarking bench_fft_g1 scale: '7': Analyzing
5bench_fft_g1 scale: '7' time: [20.432 ms 20.634 ms 20.843 ms]
6 change: [-39.822% -38.926% -38.001%] (p = 0.00 < 0.05)
7 Performance has improved.
Benchmarking ckzg
The sequential version of the ckzg library ran 2m 7s
faster than the same version of blst-from-scratch because it had other benchmarks that performed faster, though the parallelized version ran 1m 2s
faster than its sequential version. Below are the results of the sequantial fft_g1
algorithm:
1Benchmarking bench_fft_g1 scale: '7'
2Benchmarking bench_fft_g1 scale: '7': Warming up for 3.0000 s
3Benchmarking bench_fft_g1 scale: '7': Collecting 100 samples in estimated 6.8313 s (200 iterations)
4Benchmarking bench_fft_g1 scale: '7': Analyzing
5bench_fft_g1 scale: '7' time: [32.194 ms 32.471 ms 32.760 ms]
Yet the parallel version of the fft_g1
algorithm performed much faster than it did for blst-from-scratch, even though both unparallelized versions for both teams performed evenly:
1Benchmarking bench_fft_g1 scale: '7'
2Benchmarking bench_fft_g1 scale: '7': Warming up for 3.0000 s
3Benchmarking bench_fft_g1 scale: '7': Collecting 100 samples in estimated 5.0701 s (300 iterations)
4Benchmarking bench_fft_g1 scale: '7': Analyzing
5bench_fft_g1 scale: '7' time: [16.854 ms 17.107 ms 17.439 ms]
6 change: [-48.216% -47.318% -46.306%] (p = 0.00 < 0.05)
7 Performance has improved.
To cut a long story short, ckzg outperformed blst-from-scratch in all of the 7 benchmark groups.
Summary
- OpenMP lets you quickly propotype what is feasible to parallelize with the help of CPU profiling tools such as Perf
- Criterion is actually a really nice benchmarking tool to measure performance, especially when integrated into CI
- ckzg has surpassed blst-from-scratch in becoming the fastest Rust library (yet) for kzg10 commitments
