Profile-Guided Optimization

Profile-guided optimization (PGO) or feedback-driven optimization (FDO) are synonyms for the same concept. The idea is that a binary is compiled with built-in instrumentation which, when executed, will generate a profile. This profile can be used as input for the compiler in a subsequent build of the same binary, and it will serve as a guide for further optimizations.

It can be hard to do profiling of real world applications. Ideally, the profile should be generated by a representative workload of the program, but it’s not always possible to simulate a representative workload. Moreover, the built-in instrumentation impacts the overall performance of the binary which introduces a performance penalty.

In order to address these problems, we can use tools like perf to “observe” what the binary is doing externally (sampling it, by monitoring events using Linux kernel’s Performance Monitoring Unit – PMU), which makes the process more suitable to be used in production environments. This technique works better than the regular built-in instrumentation, but it still has a few drawbacks that we will expand later.

Caveats

  • The purpose of this guide is to provide some basic information about what PGO is and how it works. In order to do that, we will look at a simple example using OpenSSL (more specifically, the openssl speed command) and learn how to do basic profiling. We will not go into a deep dive on how to build the project, and it is assumed that the reader is comfortable with compilation, compiler flags and using the command line.

  • Despite being a relatively popular technique, PGO is not always the best approach to optimize a program. The profiling data generated by the workload will be extremely tied to it, which means that the optimized program might actually have worse performance when other types of workloads are executed. There is not a one-size-fits-all solution for this problem, and sometimes the best approach might be to not use PGO after all.

  • If you plan to follow along, we recommend setting up a test environment for this experiment. The ideal setup involves using a bare metal machine because it’s the more direct way to collect the performance metrics. If you would like to use a virtual machine (created using QEMU/libvirt, LXD, Multipass, etc.), it will likely only work on Intel-based processors due to how Virtual Performance Monitoring Unit (vPMU) works.

perf and AutoFDO

Using perf to monitor a process and obtain data about its runtime workload produces data files in a specific binary format that we will call perfdata. Unfortunately, GCC doesn’t understand this file format; instead, it expects a profile file in a format called gcov. To convert a perfdata file into a gcov one, we need to use a software called autofdo. This software expects the binary being profiled to obey certain constraints:

  • The binary cannot be stripped of its debug symbols. autofdo does not support separate debug information files (i.e., it can’t work with Ubuntu’s .ddeb packages), and virtually all Ubuntu packages run strip during their build in order to generate the .ddeb packages.

  • The debug information file(s) cannot be processed by dwz. This tool’s purpose is to compress the debug information generated when building a binary, and again, virtually all Ubuntu packages use it. For this reason, it is currently not possible to profile most Ubuntu packages without first rebuilding them to disable dwz from running.

  • We must be mindful of the options we pass to perf, particularly when it comes to recording branch prediction events. The options will likely vary depending on whether you are using an Intel or AMD processor, for example.

On top of that, the current autofdo version in Ubuntu (0.19-3build3, at the time of this writing) is not recent enough to process the perfdata files we will generate. There is a PPA with a newer version of autofdo package for Ubuntu Noble. If you are running another version of Ubuntu and want to install a newer version of autofdo, you will need to build the software manually (please refer to the upstream repository for further instructions).

A simple PGO scenario: openssl speed

PGO makes more sense when your software is CPU-bound, i.e., when it performs CPU intensive work and is not mostly waiting on I/O, for example. Even if your software spends time waiting on I/O, using PGO might still be helpful; its effects would be less noticeable, though.

OpenSSL has a built-in benchmark command called openssl speed, which tests the performance of its cryptographic algorithms. At first sight this seems excellent for PGO because there is practically no I/O involved, and we are only constrained by how fast the CPU can run. It is possible, however, to encounter cases where the built-in benchmark has already been highly optimized and could cause issues; we will discuss more about this problem later. In the end, the real benefit will come after you get comfortable with our example and apply similar methods against your own software stack.

Running OpenSSL tests

In order to measure the performance impact of PGO, it is important to perform several tests using an OpenSSL binary without PGO and then with it. The reason for the high number of repeated tests is because we want to eliminate outliers in the final results. It’s also important to make sure to disable as much background load as possible in the machine where the tests are being performed, since we don’t want it to influence the results.

The first thing we have to do is to do several runs of openssl speed before enabling PGO, so that we have a baseline to compare with. As explained in the sections above, autofdo does not handle stripped binaries nor dwz-processed debug information, so we need to make sure the resulting binary obeys these restrictions. See the section below for more details on that.

After confirming that everything looks OK, we are ready to start the benchmark. Let’s run the command:

$ openssl speed -seconds 60 -evp md5 sha512 rsa2048 aes-256-cbc

This will run benchmark tests for md5, sha512, rsa2048 and aes-256-cbc. Each test will last 60 seconds, and they will involve calculating as many cryptographic hashes from N-byte chunks of data (with N varying from 16 to 16384) with each algorithm (with the exception of rsa2048, whose performance is measured in signatures/verifications per second). By the end, you should see a report like the following:

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha512           91713.74k   366632.96k   684349.30k  1060512.22k  1260359.00k  1273277.92k
aes-256-cbc    1266018.36k  1437870.33k  1442743.14k  1449933.84k  1453336.84k  1453484.99k
md5             140978.58k   367562.53k   714094.14k   955267.21k  1060969.81k  1068296.87k
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.000173s 0.000012s   5776.8  86791.9

Building OpenSSL for profiling

Before we are able to profile OpenSSL, we need sure that we compile the software in a way that the generated program meets the requirements. In our case, you can use the file command after building the binary to make sure that it has not been stripped:

$ file /usr/bin/openssl
/usr/bin/openssl: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=42a5ac797b6981bd45a9ece3f67646edded56bdb, for GNU/Linux 3.2.0, with debug_info, not stripped

Note how file reports the binary as being with debug_info, not stripped. This is the output you need to look for when inspecting your own binaries.

You can use the eu-readelf command (from the elfutils package) to determine whether any compression was performed with dwz:

$ eu-readelf -S /usr/bin/openssl
[...]
Section Headers:
[Nr] Name                 Type         Addr             Off      Size     ES Flags Lk Inf Al
[...]
[31] .debug_abbrev        PROGBITS     0000000000000000 001bcbf3 000110c4  0        0   0  1
[...]

If the Flags field has C (as in compressed) in it, this means that dwz was used. You will then need to figure out how to disable its execution during the build; for this particular build, this was done by overriding the dh_dwz rule on debian/rules, setting it to an empty value, and executing the build again. The particularities of how to skip build steps depend on the build system being used, and as such this guide will not go into further details.

Using perf to obtain profile data

With OpenSSL prepared and ready to be profiled, it’s now time to use perf to obtain profiling data from our workload. First, let’s make sure we can actually access profiling data in the system. As root, run:

# echo -1 > /proc/sys/kernel/perf_event_paranoid

This should allow all users to monitor events in the system.

Then, if you are on an Intel system, you can invoke perf using:

$ sudo perf record \
	-e br_inst_retired.near_taken \
	--branch-any \
	--all-cpus \
	--output openssl-nonpgo.perfdata \
	-- openssl speed -mr -seconds 60 -evp md5 sha512 rsa2048 aes-256-cbc

On an AMD system, use the following invocation:

$ sudo perf record \
	-e 'cpu/event=0xc4,umask=0x0,name=ex_ret_brn_tkn/' \
	--branch-any \
	--all-cpus \
	--output openssl-nonpgo.perfdata \
	-- openssl speed -mr -seconds 60 -evp md5 sha512 rsa2048 aes-256-cbc

After the command has finished running, you should see a file named openssl-nonpgo.perfdata in your current directory.

Note that the only thing that differs between the Intel and AMD variants is the PMU event to be monitored. Also, note how we are using sudo to invoke perf record. This is necessary in order to obtain full access to Linux kernel symbols and relocation information. It also means that the openssl-nonpgo.perfdata file ownership will need to be adjusted:

$ sudo chown ubuntu:ubuntu openssl-nonpgo.perfdata

Now, we need to convert the file to gcov.

Converting perfdata to gcov with autofdo

With autofdo installed, you can convert the perfdata generated in the last step to gcov by doing:

$ create_gcov \
	--binary /usr/bin/openssl \
	--gcov openssl-nonpgo.gcov \
	--profile openssl-nonpgo.perfdata \
	--gcov_version 2 \
	-use_lbr false

create_gcov is verbose and will display several messages that may look like something is wrong. For example, the following output is actually from a successful run of the command:

[WARNING:[...]/perf_reader.cc:1322] Skipping 200 bytes of metadata: HEADER_CPU_TOPOLOGY
[WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event PERF_RECORD_ID_INDEX
[WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event PERF_RECORD_EVENT_UPDATE
[WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event PERF_RECORD_CPU_MAP
[WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event UNKNOWN_EVENT_17
[WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event UNKNOWN_EVENT_18
[...] many more lines [...]

Nevertheless, you should inspect its output and also make sure to check the $? shell variable to make sure that it has finished successfully. However, even if create_gcov finishes with exit 0 the generated file openssl-nonpgo.gcov might not be valid. It is also a good idea to use the dump_gcov tool on it and make sure that it actually contains valid information. For example:

$ dump_gcov openssl-nonpgo.gcov
cpu_get_tb_cpu_state total:320843603 head:8557544
  2: 8442523
  7: 8442518
  7.1: 8442518
  7.4: 9357851
  8: 9357851
  10: 9357854
  19: 0
  21: 0
[...] many more lines [...]

If the command does not print anything, there is a problem with your gcov file.

If everything went well, we can now use the gcov file and feed it back to GCC when recompiling OpenSSL.

Rebuilding OpenSSL with PGO

We have everything we need to rebuild OpenSSL and make use of our profile data. The most important thing now is to set the CFLAGS environment variable properly so that GCC can find and use the profile we generated in the previous step.

How you perform this CFLAGS adjustment depends on how you built the software in the first place, so we won’t cover this part here. The resulting CFLAGS variable should have the following GCC option, though:

-fauto-profile=/path/to/openssl-nonpgo.gcov

Make sure to adjust the path to the openssl-nonpgo.gcov file.

Also note that enabling PGO will make GCC automatically enable several optimization passes that are usually disabled, even when using -O2. This can lead to new warnings/errors in the code, which might require either the code to be adjusted or the use of extra compilation flags to suppress these warnings.

After the software is built, you can install this new version and run the benchmark tests again to compare results.

Our results

At first sight it might appear that our results don’t have much to show for themselves, but they are actually very interesting and allow us to discuss some more things about PGO.

First, it will not work for every scenario, and there might be cases when it is not justified to use it. For example, if you are dealing with a very generic program that doesn’t have a single representative workload. PGO is great when applied to specialized cases, but may negatively impact the performance of programs that can be used in a variety of ways.

It is crucial to be rigorous when profiling the program you want to optimize. Here are a few tips that might help you to obtain more reliable results:

  • Perform multiple runs of the same workload. This helps to eliminate possible outliers and decrease the contribution of external factors (like memory and cache times). In our example, we performed 5 consecutive runs without PGO and 5 consecutive runs with PGO. Performing multiple runs is useful for benchmarking the software as well as for profiling it.

  • Try to profile your software in a regular, production-like setup. For example, don’t try to artificially minimize your system’s background load because that will produce results that will likely not reflect workloads found in the real world. On the other hand, be careful about non-representative background load that might interfere with the measurements.

  • Consider whether you are dealing with code that has already been highly optimized. In fact, this is exactly the case with our example: OpenSSL is an ubiquitous cryptographic library that has received a lot of attention from several developers, and it is expected that cryptographic libraries are extremely optimized. This is the major factor that explains why we have seen minimal performance gains in our experiment: the library is already pretty performant as it is.

  • We also have written a blog post detailing a PGO case study which rendered performance improvements in the order of 5% to 10%. It is a great example of how powerful this optimization technique can be when dealing with specialized workloads.

Alternative scenario: profiling the program using other approaches

We do not always have well behaved programs that can be profiled like openssl speed. Sometimes, the program might take a long time to finish running the workload, which ends up generating huge perfdata files that can prove very challenging for autofdo to process. For example, we profiled QEMU internally and, when executing a workload that took around 3 hours to complete, the perfdata generated had a size of several gigabytes.

In order to reduce the size of the perfdata that is collected, you might want to play with perf record’s -F option, which specifies the frequency at which the profiling will be done. Using -F 99 (i.e., profile at 99 hertz) is indicated by some perf experts because it avoids accidentally sampling the program in lockstep with some other periodic activity.

Another useful trick is to profile in batches instead of running perf record for the entire duration of your program’s workload. To do that, you should use the -p option to specify a PID to perf record while also specifying sleep as the program to be executed during the profiling. For example, this would be the command like we would use if we were to profile a program whose PID is 1234 for 2 minutes:

$ sudo perf record \
	-e br_inst_retired.near_taken \
	--branch-any \
	--all-cpus \
	-p 1234 \
	--output myprogram.perfdata
	-- sleep 2m

The idea is that you would run the command above several times while the program being profiled is still running (ideally with an interval between the invocations, and taking care not to overwrite the perfdata). After obtaining all the perfdata files, you would then convert them to gcov and use profile_merger to merge all gcov files into one.

Conclusion

PGO is an interesting and complex topic that certainly catches the attention of experienced software developers looking to extract every bit of performance gain. It can certainly help optimize programs, especially those that perform specialized work. It is not a one-size-fits-all solution, though, and as with every complex technology, its use needs to be studied and justified, and its impact on the final application should be monitored and analyzed.