(perf-pgo)= # Profile-Guided Optimization [Profile-guided optimization](https://en.wikipedia.org/wiki/Profile-guided_optimization) (PGO) or feedback-driven optimization (FDO) are synonyms for the same concept. The idea is that a binary is compiled with built-in instrumentation which, when executed, will generate a profile. This profile can be used as input for the compiler in a subsequent build of the same binary, and it will serve as a guide for further optimizations. It can be hard to do profiling of real world applications. Ideally, the profile should be generated by a representative workload of the program, but it's not always possible to simulate a representative workload. Moreover, the built-in instrumentation impacts the overall performance of the binary which introduces a performance penalty. In order to address these problems, we can use tools like `perf` to "observe" what the binary is doing externally (sampling it, by monitoring events using Linux kernel's Performance Monitoring Unit -- PMU), which makes the process more suitable to be used in production environments. This technique works better than the regular built-in instrumentation, but it still has a few drawbacks that we will expand later. ## Caveats * The purpose of this guide is to provide some basic information about what PGO is and how it works. In order to do that, we will look at a simple example using OpenSSL (more specifically, the `openssl speed` command) and learn how to do basic profiling. We will not go into a deep dive on how to build the project, and it is assumed that the reader is comfortable with compilation, compiler flags and using the command line. * Despite being a relatively popular technique, PGO is not always the best approach to optimize a program. The profiling data generated by the workload will be extremely tied to it, which means that the optimized program might actually have worse performance when other types of workloads are executed. There is not a one-size-fits-all solution for this problem, and sometimes the best approach might be to **not** use PGO after all. * If you plan to follow along, we recommend setting up a test environment for this experiment. The ideal setup involves using a bare metal machine because it's the more direct way to collect the performance metrics. If you would like to use a virtual machine (created using QEMU/libvirt, LXD, Multipass, etc.), it will likely only work on Intel-based processors due to how Virtual Performance Monitoring Unit (vPMU) works. ## `perf` and AutoFDO Using `perf` to monitor a process and obtain data about its runtime workload produces data files in a specific binary format that we will call `perfdata`. Unfortunately, [GCC](https://gcc.gnu.org) doesn't understand this file format; instead, it expects a profile file in a format called `gcov`. To convert a `perfdata` file into a `gcov` one, we need to use a software called [`autofdo`](https://github.com/google/autofdo). This software expects the binary being profiled to obey certain constraints: * The binary **cannot** be stripped of its debug symbols. `autofdo` does not support separate debug information files (i.e., it can't work with Ubuntu's `.ddeb` packages), and virtually all Ubuntu packages run `strip` during their build in order to generate the `.ddeb` packages. * The debug information file(s) **cannot** be processed by `dwz`. This tool's purpose is to compress the debug information generated when building a binary, and again, virtually all Ubuntu packages use it. For this reason, it is currently not possible to profile most Ubuntu packages without first rebuilding them to disable `dwz` from running. * We must be mindful of the options we pass to `perf`, particularly when it comes to recording branch prediction events. The options will likely vary depending on whether you are using an Intel or AMD processor, for example. On top of that, the current `autofdo` version in Ubuntu (`0.19-3build3`, at the time of this writing) is not recent enough to process the `perfdata` files we will generate. There is a PPA with a newer version of `autofdo` package [for Ubuntu Noble](https://launchpad.net/~sergiodj/+archive/ubuntu/autofdo). If you are running another version of Ubuntu and want to install a newer version of `autofdo`, you will need to build the software manually (please refer to the [upstream repository](https://github.com/google/autofdo) for further instructions). ## A simple PGO scenario: `openssl speed` PGO makes more sense when your software is CPU-bound, i.e., when it performs CPU intensive work and is not mostly waiting on I/O, for example. Even if your software spends time waiting on I/O, using PGO might still be helpful; its effects would be less noticeable, though. OpenSSL has a built-in benchmark command called `openssl speed`, which tests the performance of its cryptographic algorithms. At first sight this seems excellent for PGO because there is practically no I/O involved, and we are only constrained by how fast the CPU can run. It is possible, however, to encounter cases where the built-in benchmark has already been highly optimized and could cause issues; we will discuss more about this problem later. In the end, the real benefit will come after you get comfortable with our example and apply similar methods against your own software stack. ### Running OpenSSL tests In order to measure the performance impact of PGO, it is important to perform several tests using an OpenSSL binary without PGO and then with it. The reason for the high number of repeated tests is because we want to eliminate outliers in the final results. It's also important to make sure to disable as much background load as possible in the machine where the tests are being performed, since we don't want it to influence the results. The first thing we have to do is to do several runs of `openssl speed` *before* enabling PGO, so that we have a baseline to compare with. As explained in the sections above, `autofdo` does not handle stripped binaries nor `dwz`-processed debug information, so we need to make sure the resulting binary obeys these restrictions. See the section below for more details on that. After confirming that everything looks OK, we are ready to start the benchmark. Let's run the command: ```bash $ openssl speed -seconds 60 -evp md5 sha512 rsa2048 aes-256-cbc ``` This will run benchmark tests for `md5`, `sha512`, `rsa2048` and `aes-256-cbc`. Each test will last 60 seconds, and they will involve calculating as many cryptographic hashes from `N`-byte chunks of data (with `N` varying from 16 to 16384) with each algorithm (with the exception of `rsa2048`, whose performance is measured in signatures/verifications per second). By the end, you should see a report like the following: ```text The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes sha512 91713.74k 366632.96k 684349.30k 1060512.22k 1260359.00k 1273277.92k aes-256-cbc 1266018.36k 1437870.33k 1442743.14k 1449933.84k 1453336.84k 1453484.99k md5 140978.58k 367562.53k 714094.14k 955267.21k 1060969.81k 1068296.87k sign verify sign/s verify/s rsa 2048 bits 0.000173s 0.000012s 5776.8 86791.9 ``` ### Building OpenSSL for profiling Before we are able to profile OpenSSL, we need sure that we compile the software in a way that the generated program meets the requirements. In our case, you can use the `file` command after building the binary to make sure that it has not been stripped: ```text $ file /usr/bin/openssl /usr/bin/openssl: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=42a5ac797b6981bd45a9ece3f67646edded56bdb, for GNU/Linux 3.2.0, with debug_info, not stripped ``` Note how `file` reports the binary as being `with debug_info, not stripped`. This is the output you need to look for when inspecting your own binaries. You can use the `eu-readelf` command (from the `elfutils` package) to determine whether any compression was performed with `dwz`: ```text $ eu-readelf -S /usr/bin/openssl [...] Section Headers: [Nr] Name Type Addr Off Size ES Flags Lk Inf Al [...] [31] .debug_abbrev PROGBITS 0000000000000000 001bcbf3 000110c4 0 0 0 1 [...] ``` If the `Flags` field has `C` (as in `compressed`) in it, this means that `dwz` was used. You will then need to figure out how to disable its execution during the build; for this particular build, this was done by overriding the `dh_dwz` rule on `debian/rules`, setting it to an empty value, and executing the build again. The particularities of how to skip build steps depend on the build system being used, and as such this guide will not go into further details. ### Using `perf` to obtain profile data With OpenSSL prepared and ready to be profiled, it's now time to use `perf` to obtain profiling data from our workload. First, let's make sure we can actually access profiling data in the system. As `root`, run: ```text # echo -1 > /proc/sys/kernel/perf_event_paranoid ``` This should allow all users to monitor events in the system. Then, if you are on an Intel system, you can invoke `perf` using: ```bash $ sudo perf record \ -e br_inst_retired.near_taken \ --branch-any \ --all-cpus \ --output openssl-nonpgo.perfdata \ -- openssl speed -mr -seconds 60 -evp md5 sha512 rsa2048 aes-256-cbc ``` On an AMD system, use the following invocation: ```bash $ sudo perf record \ -e 'cpu/event=0xc4,umask=0x0,name=ex_ret_brn_tkn/' \ --branch-any \ --all-cpus \ --output openssl-nonpgo.perfdata \ -- openssl speed -mr -seconds 60 -evp md5 sha512 rsa2048 aes-256-cbc ``` After the command has finished running, you should see a file named `openssl-nonpgo.perfdata` in your current directory. Note that the only thing that differs between the Intel and AMD variants is the PMU event to be monitored. Also, note how we are using `sudo` to invoke `perf record`. This is necessary in order to obtain full access to Linux kernel symbols and relocation information. It also means that the `openssl-nonpgo.perfdata` file ownership will need to be adjusted: ```bash $ sudo chown ubuntu:ubuntu openssl-nonpgo.perfdata ``` Now, we need to convert the file to `gcov`. ### Converting `perfdata` to `gcov` with `autofdo` With `autofdo` installed, you can convert the `perfdata` generated in the last step to `gcov` by doing: ```bash $ create_gcov \ --binary /usr/bin/openssl \ --gcov openssl-nonpgo.gcov \ --profile openssl-nonpgo.perfdata \ --gcov_version 2 \ -use_lbr false ``` `create_gcov` is verbose and will display several messages that may look like something is wrong. For example, the following output is actually from a successful run of the command: ```text [WARNING:[...]/perf_reader.cc:1322] Skipping 200 bytes of metadata: HEADER_CPU_TOPOLOGY [WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event PERF_RECORD_ID_INDEX [WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event PERF_RECORD_EVENT_UPDATE [WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event PERF_RECORD_CPU_MAP [WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event UNKNOWN_EVENT_17 [WARNING:[...]/perf_reader.cc:1069] Skipping unsupported event UNKNOWN_EVENT_18 [...] many more lines [...] ``` Nevertheless, you should inspect its output and also make sure to check the `$?` shell variable to make sure that it has finished successfully. However, even if `create_gcov` finishes with exit `0` the generated file `openssl-nonpgo.gcov` might not be valid. It is also a good idea to use the `dump_gcov` tool on it and make sure that it actually contains valid information. For example: ```text $ dump_gcov openssl-nonpgo.gcov cpu_get_tb_cpu_state total:320843603 head:8557544 2: 8442523 7: 8442518 7.1: 8442518 7.4: 9357851 8: 9357851 10: 9357854 19: 0 21: 0 [...] many more lines [...] ``` If the command does not print anything, there is a problem with your `gcov` file. If everything went well, we can now use the `gcov` file and feed it back to GCC when recompiling OpenSSL. ### Rebuilding OpenSSL with PGO We have everything we need to rebuild OpenSSL and make use of our profile data. The most important thing now is to set the `CFLAGS` environment variable properly so that GCC can find and use the profile we generated in the previous step. How you perform this `CFLAGS` adjustment depends on how you built the software in the first place, so we won't cover this part here. The resulting `CFLAGS` variable should have the following GCC option, though: ``` -fauto-profile=/path/to/openssl-nonpgo.gcov ``` Make sure to adjust the path to the `openssl-nonpgo.gcov` file. Also note that enabling PGO will make GCC automatically enable several optimization passes that are usually disabled, even when using `-O2`. This can lead to new warnings/errors in the code, which might require either the code to be adjusted or the use of extra compilation flags to suppress these warnings. After the software is built, you can install this new version and run the benchmark tests again to compare results. ## Our results At first sight it might appear that our results don't have much to show for themselves, but they are actually very interesting and allow us to discuss some more things about PGO. First, it will not work for every scenario, and there might be cases when it is not justified to use it. For example, if you are dealing with a very generic program that doesn't have a single representative workload. PGO is great when applied to specialized cases, but may negatively impact the performance of programs that can be used in a variety of ways. It is crucial to be rigorous when profiling the program you want to optimize. Here are a few tips that might help you to obtain more reliable results: * Perform multiple runs of the same workload. This helps to eliminate possible outliers and decrease the contribution of external factors (like memory and cache times). In our example, we performed 5 consecutive runs without PGO and 5 consecutive runs with PGO. Performing multiple runs is useful for benchmarking the software as well as for profiling it. * Try to profile your software in a regular, production-like setup. For example, don't try to artificially minimize your system's background load because that will produce results that will likely not reflect workloads found in the real world. On the other hand, be careful about non-representative background load that might interfere with the measurements. * Consider whether you are dealing with code that has already been highly optimized. In fact, this is exactly the case with our example: OpenSSL is an ubiquitous cryptographic library that has received *a lot* of attention from several developers, and it is expected that cryptographic libraries are extremely optimized. This is the major factor that explains why we have seen minimal performance gains in our experiment: the library is already pretty performant as it is. * We also have written a [blog post](https://ubuntu.com/blog/profile-guided-optimization-a-case-study) detailing a PGO case study which rendered performance improvements in the order of 5% to 10%. It is a great example of how powerful this optimization technique can be when dealing with specialized workloads. ## Alternative scenario: profiling the program using other approaches We do not always have well behaved programs that can be profiled like `openssl speed`. Sometimes, the program might take a long time to finish running the workload, which ends up generating huge `perfdata` files that can prove very challenging for `autofdo` to process. For example, we [profiled QEMU internally](https://ubuntu.com/blog/profile-guided-optimization-a-case-study) and, when executing a workload that took around 3 hours to complete, the `perfdata` generated had a size of several gigabytes. In order to reduce the size of the `perfdata` that is collected, you might want to play with `perf record`'s `-F` option, which specifies the frequency at which the profiling will be done. Using `-F 99` (i.e., profile at 99 hertz) is indicated by some `perf` experts because it avoids accidentally sampling the program in lockstep with some other periodic activity. Another useful trick is to profile in batches instead of running `perf record` for the entire duration of your program's workload. To do that, you should use the `-p` option to specify a PID to `perf record` while also specifying `sleep` as the program to be executed during the profiling. For example, this would be the command like we would use if we were to profile a program whose PID is `1234` for 2 minutes: ```bash $ sudo perf record \ -e br_inst_retired.near_taken \ --branch-any \ --all-cpus \ -p 1234 \ --output myprogram.perfdata -- sleep 2m ``` The idea is that you would run the command above several times while the program being profiled is still running (ideally with an interval between the invocations, and taking care not to overwrite the `perfdata`). After obtaining all the `perfdata` files, you would then convert them to `gcov` and use `profile_merger` to merge all `gcov` files into one. ## Conclusion PGO is an interesting and complex topic that certainly catches the attention of experienced software developers looking to extract every bit of performance gain. It can certainly help optimize programs, especially those that perform specialized work. It is not a one-size-fits-all solution, though, and as with every complex technology, its use needs to be studied and justified, and its impact on the final application should be monitored and analyzed.