June 17, 2015 - Tilmann Scheller
A Conclusion to Accelerating Your Build with Clang
This is the second part of a series that explores improving the build process using Clang. This post will take a look at the remaining methods for speeding up the build and will conclude with an overall summary of the improved speeds. To read an introduction to this experiment as well as the build system we are using, take a look at part one of this series.
Here is our list of ideas again:
- Build with Clang rather than GCC
- Use a faster linker, such as GNU gold
- Optimize our host compiler binary aggressively (LTO, PGO, LTO+PGO)
- Move parts of the debug info into separate files: split DWARF
- Use Ninja rather than GNU make in CMake-based projects
- Cache object files with ccache
- Use a shared library build instead of a static build for incremental debug builds
- Optimize TableGen for debug builds
- Build less: e.g. build only the backends we are actually interested in rather than all of them
Note: The items above in darker text (the last five) will be covered in this post, the first four items were covered in part one.
Let’s give the remaining ideas a try to see how each of them work out in practice.
CMake Ninja Generator
Since CMake is a meta build system, it can generate Makefiles for GNU Make as well as build files for the Ninja build system. Ninja is heavily optimized for speed and it shines particularly during incremental builds. However, as you can see in the following chart it also helps to decrease the build time of clean builds:
The performance increase is small but noticeable, however, it is somewhat unclear why the debug build gets a stronger boost. Independent of performance aspects, using Ninja for the build provides additional advantages, including the following options which only work in conjunction with Ninja:
Caching Object Files with ccache
Another way to speed up the build is to cache object files. The idea here is to avoid redundant work for the compiler by storing the object files in a cache and whenever the same compiler invocation occurs again, the object file is copied from the cache rather than generated from scratch by the compiler. On Linux, ccache is a very popular compiler cache and the following chart shows the results of a Clang build with GCC 4.9.2 as a host compiler both with and without ccache (the ccache numbers are based on a fully populated cache, e.g. 100% cache hits during the build):
With ccache we can see a large speedup both for debug and release builds. The speedup for release builds is an order of magnitude higher as the object files in a release build are much smaller and there’s also significantly less time spent in the linker. The ccache debug build is effectively bottlenecked on I/O bandwidth and replacing the HDD in the machine with a faster SSD would help to decrease the build time even further.
With the above results, it is generally recommend to have ccache enabled all the time. There’s one drawback though: ccache doesn’t handle -gsplit-dwarf yet, which is quite unfortunate. However, for people willing to build their own ccache binary there is a patch adding the required functionality which hopefully will land in upstream soon.
Incremental Debug Builds
Link time is a big issue for incremental debug builds, especially since LLVM defaults to static builds. A single source file that is changed usually results in a large number of binaries being relinked. Fortunately the build system also allows for shared library builds, which limit the amount of relinking to the shared library that contains the source file being edited.
The following chart shows the incremental build time after changing the modification date of a file in the ARM backend (ARMISelLowering.cpp). This is measured with a debug build using GCC 4.9.2 as a host compiler.
As you can see, a shared library build yields a big speedup for incremental changes. The positive effect of -gsplit-dwarf is also much more visible with incremental builds as a large portion of time is spent in the linker. We highly recommend Clang/LLVM developers always use shared debug builds rather than static builds.
Optimized TableGen for Debug Builds
TableGen is a domain-specific language that is used for various purposes within LLVM. In particular, the target-independent code generator relies heavily on TableGen. Specifically, the instruction format of native backend instructions are specified with TableGen in addition to patterns to map the target-independent instructions to native instructions. Among other things, these instruction descriptions and patterns are used to generate an instruction selector automatically. During a Clang/LLVM build the TableGen binary gets invoked many times to automatically create all kinds of source files based on the particular TableGen backend and description.
The TableGen binary is built in the early stages of a Clang/LLVM build, and by default it’s built with the same compilation flags the rest of the sources use. This means when performing a debug build it takes longer to process all the TableGen descriptions because the TableGen binary is built at -O0. Luckily, for those who are not interested in debugging TableGen itself, but simply want it to execute as fast as possible, there’s the -DLLVM_OPTIMIZED_TABLEGEN=ON flag which ensures that TableGen is built at -O3 in a debug build.
The following chart shows the performance improvement when doing a debug build with an optimized TableGen binary (using a Clang host compiler compiled with Clang at -O3):
We don’t see a huge improvement, but overall there is still a nice 4% speed increase in combination with -gsplit-dwarf.
Another way to decrease the build time of Clang/LLVM is to build only the backends you actually care about. For example, in my daily development work I’m only interested in the ARM, AArch64 and x86 backend so I usually don’t build all the other backends. In the following chart you can see the impact of only building the host backend (building Clang with GCC 4.9.2 as a host compiler):
We get a nice 27% speedup for debug builds and a 25% speedup for release builds.
Why You Should Use Clang
The measurement results show a very clear message: Always build with Clang if you care about compile time, and simply moving to Clang can speed up a debug build by about 50%. Many Linux distributions ship Clang packages, so it’s quite easy to install. Moving to GNU gold is also highly recommended because it is very stable and seems to have very few issues. If you really want to squeeze the maximum performance out of your toolchain, we also recommend you build your own Clang host compiler with GCC in PGO mode. For Clang/LLVM developers the single most important recommendation is to always use shared library builds for debug builds.
Let’s have a look at the overall speedup we’ve accomplished. We started out with the GCC 4.9.2 and GNU ld shipping with Fedora 21, and we ended up with the fastest configuration being a PGO compiled Clang and GNU gold. This gives us an overall speedup of 58% for release builds and 109% for debug builds. That’s not bad at all and it proves that you leave a lot of performance on the table if you just stick to the default build setup!
Obviously not everyone is willing to go through the effort of creating a PGO build of Clang to use for their daily development work, but moving from GCC and GNU ld to Clang and GNU gold can be done in as little as a minute, assuming your distribution ships the respective packages. Fedora, Ubuntu, openSUSE, and Debian, are among the distributions that support these tools, and they alone yield a respectable speedup:
In summary: Try out Clang today and enjoy the free speed boost to your build!
About Tilmann Scheller