Profile-Guided Optimization¶
Profile-guided optimization (PGO) or feedback-driven optimization (FDO) are synonyms for the same concept. The idea is that a binary is compiled with built-in instrumentation which, when executed, will generate a profile. This profile can be used as input for the compiler in a subsequent build of the same binary, and it will serve as a guide for further optimizations.
It can be hard to do profiling of real world applications. Ideally, the profile should be generated by a representative workload of the program, but it’s not always possible to simulate a representative workload. Moreover, the built-in instrumentation impacts the overall performance of the binary which introduces a performance penalty.
In order to address these problems, we can use tools like perf
to “observe” what the binary is doing externally (sampling it, by monitoring events using Linux kernel’s Performance Monitoring Unit – PMU), which makes the process more suitable to be used in production environments. This technique works better than the regular built-in instrumentation, but it still has a few drawbacks that we will expand later.
Caveats¶
The purpose of this guide is to provide some basic information about what PGO is and how it works. In order to do that, we will look at a simple example using OpenSSL (more specifically, the
openssl speed
command) and learn how to do basic profiling. We will not go into a deep dive on how to build the project, and it is assumed that the reader is comfortable with compilation, compiler flags and using the command line.Despite being a relatively popular technique, PGO is not always the best approach to optimize a program. The profiling data generated by the workload will be extremely tied to it, which means that the optimized program might actually have worse performance when other types of workloads are executed. There is not a one-size-fits-all solution for this problem, and sometimes the best approach might be to not use PGO after all.
If you plan to follow along, we recommend setting up a test environment for this experiment. The ideal setup involves using a bare metal machine because it’s the more direct way to collect the performance metrics. If you would like to use a virtual machine (created using QEMU/libvirt, LXD, Multipass, etc.), it will likely only work on Intel-based processors due to how Virtual Performance Monitoring Unit (vPMU) works.
perf
and AutoFDO¶
Using perf
to monitor a process and obtain data about its runtime workload produces data files in a specific binary format that we will call perfdata
. Unfortunately, GCC doesn’t understand this file format; instead, it expects a profile file in a format called gcov
. To convert a perfdata
file into a gcov
one, we need to use a software called autofdo
. This software expects the binary being profiled to obey certain constraints:
The binary cannot be stripped of its debug symbols.
autofdo
does not support separate debug information files (i.e., it can’t work with Ubuntu’s.ddeb
packages), and virtually all Ubuntu packages runstrip
during their build in order to generate the.ddeb
packages.The debug information file(s) cannot be processed by
dwz
. This tool’s purpose is to compress the debug information generated when building a binary, and again, virtually all Ubuntu packages use it. For this reason, it is currently not possible to profile most Ubuntu packages without first rebuilding them to disabledwz
from running.We must be mindful of the options we pass to
perf
, particularly when it comes to recording branch prediction events. The options will likely vary depending on whether you are using an Intel or AMD processor, for example.
On top of that, the current autofdo
version in Ubuntu (0.19-3build3
, at the time of this writing) is not recent enough to process the perfdata
files we will generate. There is a PPA with a newer version of autofdo
package for Ubuntu Noble. If you are running another version of Ubuntu and want to install a newer version of autofdo
, you will need to build the software manually (please refer to the upstream repository for further instructions).
A simple PGO scenario: openssl speed
¶
PGO makes more sense when your software is CPU-bound, i.e., when it performs CPU intensive work and is not mostly waiting on I/O, for example. Even if your software spends time waiting on I/O, using PGO might still be helpful; its effects would be less noticeable, though.
OpenSSL has a built-in benchmark command called openssl speed
, which tests the performance of its cryptographic algorithms. At first sight this seems excellent for PGO because there is practically no I/O involved, and we are only constrained by how fast the CPU can run. It is possible, however, to encounter cases where the built-in benchmark has already been highly optimized and could cause issues; we will discuss more about this problem later. In the end, the real benefit will come after you get comfortable with our example and apply similar methods against your own software stack.
Running OpenSSL tests¶
In order to measure the performance impact of PGO, it is important to perform several tests using an OpenSSL binary without PGO and then with it. The reason for the high number of repeated tests is because we want to eliminate outliers in the final results. It’s also important to make sure to disable as much background load as possible in the machine where the tests are being performed, since we don’t want it to influence the results.
The first thing we have to do is to do several runs of openssl speed
before enabling PGO, so that we have a baseline to compare with. As explained in the sections above, autofdo
does not handle stripped binaries nor dwz
-processed debug information, so we need to make sure the resulting binary obeys these restrictions. See the section below for more details on that.
After confirming that everything looks OK, we are ready to start the benchmark. Let’s run the command:
$ openssl speed -seconds 60 -evp md5 sha512 rsa2048 aes-256-cbc
This will run benchmark tests for md5
, sha512
, rsa2048
and aes-256-cbc
. Each test will last 60 seconds, and they will involve calculating as many cryptographic hashes from N
-byte chunks of data (with N
varying from 16 to 16384) with each algorithm (with the exception of rsa2048
, whose performance is measured in signatures/verifications per second). By the end, you should see a report like the following:
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha512 91713.74k 366632.96k 684349.30k 1060512.22k 1260359.00k 1273277.92k
aes-256-cbc 1266018.36k 1437870.33k 1442743.14k 1449933.84k 1453336.84k 1453484.99k
md5 140978.58k 367562.53k 714094.14k 955267.21k 1060969.81k 1068296.87k
sign verify sign/s verify/s
rsa 2048 bits 0.000173s 0.000012s 5776.8 86791.9
Building OpenSSL for profiling¶
Before we are able to profile OpenSSL, we need sure that we compile the software in a way that the generated program meets the requirements. In our case, you can use the file
command after building the binary to make sure that it has not been stripped:
$ file /usr/bin/openssl
/usr/bin/openssl: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=42a5ac797b6981bd45a9ece3f67646edded56bdb, for GNU/Linux 3.2.0, with debug_info, not stripped
Note how file
reports the binary as being with debug_info, not stripped
. This is the output you need to look for when inspecting your own binaries.
You can use the eu-readelf
command (from the elfutils
package) to determine whether any compression was performed with dwz
:
$ eu-readelf -S /usr/bin/openssl
[...]
Section Headers:
[Nr] Name Type Addr Off Size ES Flags Lk Inf Al
[...]
[31] .debug_abbrev PROGBITS 0000000000000000 001bcbf3 000110c4 0 0 0 1
[...]
If the Flags
field has C
(as in compressed
) in it, this means that dwz
was used. You will then need to figure out how to disable its execution during the build; for this particular build, this was done by overriding the dh_dwz
rule on debian/rules
, setting it to an empty value, and executing the build again. The particularities of how to skip build steps depend on the build system being used, and as such this guide will not go into further details.
Using perf
to obtain profile data¶
With OpenSSL prepared and ready to be profiled, it’s now time to use perf
to obtain profiling data from our workload. First, let’s make sure we can actually access profiling data in the system. As root
, run:
# echo -1 > /proc/sys/kernel/perf_event_paranoid
This should allow all users to monitor events in the system.
Then, if you are on an Intel system, you can invoke perf
using:
$ sudo perf record \
-e br_inst_retired.near_taken \
--branch-any \
--all-cpus \
--output openssl-nonpgo.perfdata \
-- openssl speed -mr -seconds 60 -evp md5 sha512 rsa2048 aes-256-cbc
On an AMD system, use the following invocation:
$ sudo perf record \
-e 'cpu/event=0xc4,umask=0x0,name=ex_ret_brn_tkn/' \
--branch-any \
--all-cpus \
--output openssl-nonpgo.perfdata \
-- openssl speed -mr -seconds 60 -evp md5 sha512 rsa2048 aes-256-cbc
After the command has finished running, you should see a file named openssl-nonpgo.perfdata
in your current directory.
Note that the only thing that differs between the Intel and AMD variants is the PMU event to be monitored. Also, note how we are using sudo
to invoke perf record
. This is necessary in order to obtain full access to Linux kernel symbols and relocation information. It also means that the openssl-nonpgo.perfdata
file ownership will need to be adjusted:
$ sudo chown ubuntu:ubuntu openssl-nonpgo.perfdata
Now, we need to convert the file to gcov
.
Converting perfdata
to gcov
with autofdo
¶
With autofdo
installed, you can convert the perfdata
generated in the last step to gcov
by doing:
$ create_gcov \
--binary /usr/bin/openssl \
--gcov openssl-nonpgo.gcov \
--profile openssl-nonpgo.perfdata \
--gcov_version 2 \
-use_lbr false
create_gcov
is verbose and will display several messages that may look like something is wrong. For example, the following output is actually from a successful run of the command:
[WARNING:[...]/perf_reader.cc:1322] Skipping 200 bytes of metadata: HEADER_CPU_TOPOLOGY
[WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event PERF_RECORD_ID_INDEX
[WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event PERF_RECORD_EVENT_UPDATE
[WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event PERF_RECORD_CPU_MAP
[WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event UNKNOWN_EVENT_17
[WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event UNKNOWN_EVENT_18
[...] many more lines [...]
Nevertheless, you should inspect its output and also make sure to check the $?
shell variable to make sure that it has finished successfully. However, even if create_gcov
finishes with exit 0
the generated file openssl-nonpgo.gcov
might not be valid. It is also a good idea to use the dump_gcov
tool on it and make sure that it actually contains valid information. For example:
$ dump_gcov openssl-nonpgo.gcov
cpu_get_tb_cpu_state total:320843603 head:8557544
2: 8442523
7: 8442518
7.1: 8442518
7.4: 9357851
8: 9357851
10: 9357854
19: 0
21: 0
[...] many more lines [...]
If the command does not print anything, there is a problem with your gcov
file.
If everything went well, we can now use the gcov
file and feed it back to GCC when recompiling OpenSSL.
Rebuilding OpenSSL with PGO¶
We have everything we need to rebuild OpenSSL and make use of our profile data. The most important thing now is to set the CFLAGS
environment variable properly so that GCC can find and use the profile we generated in the previous step.
How you perform this CFLAGS
adjustment depends on how you built the software in the first place, so we won’t cover this part here. The resulting CFLAGS
variable should have the following GCC option, though:
-fauto-profile=/path/to/openssl-nonpgo.gcov
Make sure to adjust the path to the openssl-nonpgo.gcov
file.
Also note that enabling PGO will make GCC automatically enable several optimization passes that are usually disabled, even when using -O2
. This can lead to new warnings/errors in the code, which might require either the code to be adjusted or the use of extra compilation flags to suppress these warnings.
After the software is built, you can install this new version and run the benchmark tests again to compare results.
Our results¶
At first sight it might appear that our results don’t have much to show for themselves, but they are actually very interesting and allow us to discuss some more things about PGO.
First, it will not work for every scenario, and there might be cases when it is not justified to use it. For example, if you are dealing with a very generic program that doesn’t have a single representative workload. PGO is great when applied to specialized cases, but may negatively impact the performance of programs that can be used in a variety of ways.
It is crucial to be rigorous when profiling the program you want to optimize. Here are a few tips that might help you to obtain more reliable results:
Perform multiple runs of the same workload. This helps to eliminate possible outliers and decrease the contribution of external factors (like memory and cache times). In our example, we performed 5 consecutive runs without PGO and 5 consecutive runs with PGO. Performing multiple runs is useful for benchmarking the software as well as for profiling it.
Try to profile your software in a regular, production-like setup. For example, don’t try to artificially minimize your system’s background load because that will produce results that will likely not reflect workloads found in the real world. On the other hand, be careful about non-representative background load that might interfere with the measurements.
Consider whether you are dealing with code that has already been highly optimized. In fact, this is exactly the case with our example: OpenSSL is an ubiquitous cryptographic library that has received a lot of attention from several developers, and it is expected that cryptographic libraries are extremely optimized. This is the major factor that explains why we have seen minimal performance gains in our experiment: the library is already pretty performant as it is.
We also have written a blog post detailing a PGO case study which rendered performance improvements in the order of 5% to 10%. It is a great example of how powerful this optimization technique can be when dealing with specialized workloads.
Alternative scenario: profiling the program using other approaches¶
We do not always have well behaved programs that can be profiled like openssl speed
. Sometimes, the program might take a long time to finish running the workload, which ends up generating huge perfdata
files that can prove very challenging for autofdo
to process. For example, we profiled QEMU internally and, when executing a workload that took around 3 hours to complete, the perfdata
generated had a size of several gigabytes.
In order to reduce the size of the perfdata
that is collected, you might want to play with perf record
’s -F
option, which specifies the frequency at which the profiling will be done. Using -F 99
(i.e., profile at 99 hertz) is indicated by some perf
experts because it avoids accidentally sampling the program in lockstep with some other periodic activity.
Another useful trick is to profile in batches instead of running perf record
for the entire duration of your program’s workload. To do that, you should use the -p
option to specify a PID to perf record
while also specifying sleep
as the program to be executed during the profiling. For example, this would be the command like we would use if we were to profile a program whose PID is 1234
for 2 minutes:
$ sudo perf record \
-e br_inst_retired.near_taken \
--branch-any \
--all-cpus \
-p 1234 \
--output myprogram.perfdata
-- sleep 2m
The idea is that you would run the command above several times while the program being profiled is still running (ideally with an interval between the invocations, and taking care not to overwrite the perfdata
). After obtaining all the perfdata
files, you would then convert them to gcov
and use profile_merger
to merge all gcov
files into one.
Conclusion¶
PGO is an interesting and complex topic that certainly catches the attention of experienced software developers looking to extract every bit of performance gain. It can certainly help optimize programs, especially those that perform specialized work. It is not a one-size-fits-all solution, though, and as with every complex technology, its use needs to be studied and justified, and its impact on the final application should be monitored and analyzed.