Outperforming Rayon with OpenMP

Replacing Rayon with OpenMP for additional gains

Nov 16, 2021 rust c perf openmp blockchain kzg-proofs bls12-381

rip-craberino

Introduction

For the Blockchain Technologies course, students were paired into groups and assigned to produce the fastest Rust library (crate) implementing the KZG10 scheme. Two teams used the same backend, that is, blst (implemented in assembly with direct bindings for Rust and C). The first team, blst-from-scratch, was using the said Rust bindings to produce an interface closer to c-kzg, whereas the ckzg team, which I was part of, was responsible for porting the latter over to Rust.

Choosing the right tool for the job

It’s kind of obvious for Rust programmers to pick Rayon out of the box because there aren’t any other viable options for writing parallel code, except for std::threads, but who wants to manually create and manage threads when simpler solutions exist, anyway? I had to make a decision on which technique to go by:

I picked OpenMP because, while experimenting, I found it to produce the best results, and it was a no-brainer to use except for coming up with how to seemingly integrate it with Rust so that it would work on multiple platforms. In the end, I came up with a simple Bash script to automate the whole building and packaging of shared libraries, and, luckily enough, OpenMP was easily integrated into Rust by either:

 1# Linux
 2apt install libomp-dev
 3export LIBOMP_PATH=$(find /usr/lib/llvm* -name libiomp5.so | head -n 1)
 4
 5# MacOS
 6brew install libomp
 7ln -s /usr/local/opt/libomp/lib/libomp.dylib /usr/local/lib
 8ln -s /usr/local/opt/libomp/include/omp.h /usr/local/include
 9export LIBOMP_PATH=/usr/local/lib/libomp.dylib
10
11# And finally
12export RUSTFLAGS="-C link-arg=$LIBOMP_PATH"
1[build]
2rustflags = [
3  "-C", "link-arg=PATH_TO_LIBOMP_SO_OR_DYLIB"
4]

Simple!

Searching for bottlenecks

The general procedure for looking for what to parallelize is to use CPU profiling tools such as Perf that produce flamegraphs, which are a really nice visual way to represent the CPU time of a program. Below is the flamegraph that was generated by running the c-kzg’s fft_g1 benchmark:

flamegraph-of-fft-g1

which, as I’ve noticed, had a huge impact on performance with the majority of calls tracing down to assembly code; a similar kzg implementation library called go-kzg that was taken as a reference a few times while producing unit tests and benchmarks, showed that their fft_g1 benchmark took the longest time to execute as well, among others.

So, we have 7 groups of benchmarks:

of which let’s pick fft_g1 from the fft group to parallelize out!

Parallelizing fft_g1

The fft_g1 function in all cases calls the fft_g1_fast function, which applies the divide-and-conquer principle to divide a large problem into smaller subproblems, recursively solving each of them. The general procedure here is to distribute work (fft_f1_fasts) among worker threads.

The blst-from-scratch team implemented it as follows:

 1let (lo, hi) = ret.split_at_mut(half);
 2rayon::join(
 3  || fft_g1_fast(lo, data, stride * 2, roots, roots_stride * 2),
 4  || fft_g1_fast(hi, &data[stride..], stride * 2, roots, roots_stride * 2)
 5);
 6
 7for i in 0..half {
 8  let y_times_root = ret[i + half].mul(&roots[i * roots_stride]);
 9  ret[i + half] = ret[i].sub(&y_times_root);
10  ret[i] = ret[i].add_or_dbl(&y_times_root);
11}

As a side note, rayon::join spawns two threads, one executing each of the two closures.

The C equivalent, on the other hand, was as follows:

 1#pragma omp parallel sections
 2{
 3  #pragma omp section
 4  {
 5    fft_g1_fast(out, in, stride * 2, roots, roots_stride * 2, half);
 6  }
 7  #pragma omp section
 8  {
 9    fft_g1_fast(out + half, in + stride, stride * 2, roots, roots_stride * 2, half);
10  }
11}
12#pragma omp parallel
13#pragma omp for
14for (uint64_t i = 0; i < half; i++) {
15  g1_t y_times_root;
16  g1_mul(&y_times_root, &out[i + half], &roots[i * roots_stride]);
17  g1_sub(&out[i + half], &out[i], &y_times_root);
18  g1_add_or_dbl(&out[i], &out[i], &y_times_root);
19}

In addition to parallel sections, I also utilized OpenMP’s parallel for-loop, because I noticed it yielded a 5% greater performance on my personal machine. Considering the ubuntu-latest runner in GitHub Actions CI had only two available threads, the halves of the problem were shared among those two threads where each ran the for-loop to do arithmetic operations on polynomial G1 points.

In the above code snippets, fft_g1 calls fft_g1_fast, which up to scale 16 should at most 1 << 15 times recursively call itself, where each such call will be distributed among the 2 threads. Since we’re computing fft_g1 up to scale 8, there should be (1 << 7) + 1 tasks (not to be confused by OpenMP’s task pragma directive!) for fft_g1_fast or 129 such tasks that will be run in parallel!

Local c-kzg benchmark

Running on my personal laptop with i5-7300HQ (4 threads overclocked @ 3.50GHz), all mitigations turned off, and a custom Liquorix kernel, I was able to produce the following results:

Original c-kzg libraryParallelized c-kzg library
 1$ ./fft_g1_bench
 2*** Benchmarking FFT_g1, 1 second per test.       
 3fft_g1/scale_4 1729769 ns/op
 4fft_g1/scale_5 4935085 ns/op
 5fft_g1/scale_6 12897731 ns/op
 6fft_g1/scale_7 32022026 ns/op
 7fft_g1/scale_8 76552852 ns/op
 8fft_g1/scale_9 184970057 ns/op
 9fft_g1/scale_10 418273808 ns/op
10fft_g1/scale_11 919499032 ns/op
11fft_g1/scale_12 2025633037 ns/op
12fft_g1/scale_13 4479830518 ns/op
13fft_g1/scale_14 9754557496 ns/op
14fft_g1/scale_15 21125613058 ns/op
 1$ OMP_NUM_THREADS=4 ./fft_g1_bench
 2*** Benchmarking FFT_g1, 1 second per test.       
 3fft_g1/scale_4 839454 ns/op
 4fft_g1/scale_5 2378457 ns/op
 5fft_g1/scale_6 6404191 ns/op
 6fft_g1/scale_7 16325966 ns/op
 7fft_g1/scale_8 38141754 ns/op
 8fft_g1/scale_9 90948810 ns/op
 9fft_g1/scale_10 204757690 ns/op
10fft_g1/scale_11 457509973 ns/op
11fft_g1/scale_12 1006089135 ns/op
12fft_g1/scale_13 2240095284 ns/op
13fft_g1/scale_14 4879448286 ns/op
14fft_g1/scale_15 10650876381 ns/op

That’s twice as fast with as little effort as putting in a few pragmas!

GitHub Actions CI benchmarks

The fft_g1 benchmark was limited to scale 7 because the overall run time for the job exceeds the 6 hour limit if I were to benchmark it up to scale 16, as Criterion runs each iteration a couple of hundred times to produce more accurate results, and that used to automatically cancel other running tasks as jobs submitted to GitHub Actions have a limit of 360 minutes.

Benchmarking blst-from-scratch

from-scratch-github-actions

From the above screenshot we can see that the parallelized version of the produced library ran 1m 28s shorter than its sequential version, and below are the results of sequential fft_g1 algorithm:

1Benchmarking bench_fft_g1 scale: '7'
2Benchmarking bench_fft_g1 scale: '7': Warming up for 3.0000 s
3Benchmarking bench_fft_g1 scale: '7': Collecting 100 samples in estimated 6.6364 s (200 iterations)
4Benchmarking bench_fft_g1 scale: '7': Analyzing
5bench_fft_g1 scale: '7' time:   [33.423 ms 33.785 ms 34.150 ms]

of which the average run time for scale 7 was cut down by 38.926% by its parallel counterpart:

1Benchmarking bench_fft_g1 scale: '7'
2Benchmarking bench_fft_g1 scale: '7': Warming up for 3.0000 s
3Benchmarking bench_fft_g1 scale: '7': Collecting 100 samples in estimated 6.3282 s (300 iterations)
4Benchmarking bench_fft_g1 scale: '7': Analyzing
5bench_fft_g1 scale: '7' time:   [20.432 ms 20.634 ms 20.843 ms]
6                        change: [-39.822% -38.926% -38.001%] (p = 0.00 < 0.05)
7                        Performance has improved.

Benchmarking ckzg

ckzg-github-actions

The sequential version of the ckzg library ran 2m 7s faster than the same version of blst-from-scratch because it had other benchmarks that performed faster, though the parallelized version ran 1m 2s faster than its sequential version. Below are the results of the sequantial fft_g1 algorithm:

1Benchmarking bench_fft_g1 scale: '7'
2Benchmarking bench_fft_g1 scale: '7': Warming up for 3.0000 s
3Benchmarking bench_fft_g1 scale: '7': Collecting 100 samples in estimated 6.8313 s (200 iterations)
4Benchmarking bench_fft_g1 scale: '7': Analyzing
5bench_fft_g1 scale: '7' time:   [32.194 ms 32.471 ms 32.760 ms]

Yet the parallel version of the fft_g1 algorithm performed much faster than it did for blst-from-scratch, even though both unparallelized versions for both teams performed evenly:

1Benchmarking bench_fft_g1 scale: '7'
2Benchmarking bench_fft_g1 scale: '7': Warming up for 3.0000 s
3Benchmarking bench_fft_g1 scale: '7': Collecting 100 samples in estimated 5.0701 s (300 iterations)
4Benchmarking bench_fft_g1 scale: '7': Analyzing
5bench_fft_g1 scale: '7' time:   [16.854 ms 17.107 ms 17.439 ms]
6                        change: [-48.216% -47.318% -46.306%] (p = 0.00 < 0.05)
7                        Performance has improved.

To cut a long story short, ckzg outperformed blst-from-scratch in all of the 7 benchmark groups.

Summary


selfie

The author behind this post

Hi there! My online handle is belijzajac, which translates to "white hare" in Russian. I'm a software developer with a strong passion for C++20, Rust, Linux, and compilers. With a solid foundation in these technologies, I'm always looking for new ways to challenge myself and grow as a developer. In my free time, you can find me tinkering with new technologies and keeping up to date with the latest industry trends. Thank you for visiting my blog!