WARNING: WORK IN PROGRESS

Introduction

In this document, I will try to explain various optimization techniques that can be applied to software development to improve its performance, power efficiency, and security. These techniques include Profile-Guided Optimization (PGO), Link-Time Optimization (LTO), LLVM BOLT, Compiler Flags, and other optimization techniques such as dead code elimination, constant propagation, register allocation, loop unrolling, inlining, instruction scheduling and vectorization, and interprocedural optimization (IPO). I will explain the concepts, working mechanism, benefits, and limitations of each optimization technique, along with examples of how they can be applied in practice. By the end of this document, I hope that the readers will have a comprehensive understanding of the different optimization techniques available and be able to choose the appropriate optimization technique for their specific use case.

Link-Time Optimization (LTO)

Definition and concept

How LTO works

LLVM BOLT

Definition and concept

LLVM BOLT (Binary Optimization and Analysis Tool) is a tool for optimizing and analyzing binary code. It is built on top of the LLVM compiler infrastructure, and it works by analyzing the control flow graph (CFG) of the binary code and applying a variety of optimization techniques to improve performance.

How BOLT works

BOLT runs passes with either code transformations or analyses, similar to a compiler. BOLT is also equipped with a data-flow analysis framework to feed information to passes that need it. Some passes are architecture-independent while others are not. In this section, we discuss the passes applied to the Intel x86-64 target.

Pass NameDescription
1. strip-rep-retStrip repz from repz retq instructions used for legacy AMD processors
2. icfIdentical code folding
3. icpIndirect call promotion
4. peepholesSimple peephole optimizations
5. simplify-ro-loadsFetch constant data in .rodata whose address is known statically and mutate a load into a mov
6. icfIdentical code folding (second run)
7. pltRemove indirection from PLT calls
8. reorder-bbsReorder basic blocks and split hot/cold blocks into separate sections (layout optimization)
9. peepholesSimple peephole optimizations (second run)
10. uceEliminate unreachable basic blocks
11. fixup-branchesFix basic block terminator instructions to match the CFG and the current layout (redone by reorder-bbs)
12. reorder-functionsApply HFSort to reorder functions (layout optimization)
13. sctcSimplify conditional tail calls
14. frame-optsRemoves unnecessary caller-saved register spilling
15. shrink-wrappingMoves callee-saved register spills closer to where they are needed, if profiling data shows it is better to do so

Some parts of bellow information is rewritten and some are directly taken from the paper.

strip-rep-ret:

In favor of I-cache space, such as removing alignment NOPs and AMD-friendly REPZ bytes, or using shorter versions of instructions. Our findings show that, for large applications, it is better to aggressively reduce I-cache occupation, except if the change incurs D-cache overhead since the cache is one of the most constrained resources in the data-center space. This explains BOLT’s policy of discarding all NOPs after reading the input binary. Even though compiler-generated alignment NOPs are generally useful, the extra space required by them does not pay off and simply stripping them from the binary provides a small but measurable performance improvement.

icf:

BOLT features identical code folding (ICF) to complement the ICF optimization done by the linker. An additional benefit of doing ICF at the binary level is the ability to optimize functions that were compiled without the -ffunction-sections flag and functions that contain jump tables. As a result, BOLT is able to fold more identical functions than the linkers. We have measured the reduction of code size for the HHVM binary [19] to be about 3% on top of the linker’s ICF pass.

icp and plt:

Passes 3 and 7 leverage call frequency information to optimize function calls. Pass 3, indirect call promotion, mutates a function call into a more performant version, while pass 7, PLT call optimization, removes indirection from PLT calls. BOLT also has the ability to do function inlining, but it is a limited version compared to what compilers perform at higher levels. The inlining opportunities that BOLT can take advantage of typically come from more accurate profile data, BOLT’s own indirect-call promotion optimization, cross-module nature, or a combination of these factors.

simplify-ro-loads:

Simplification of load instructions, explores a tricky tradeoff by fetching data from statically known values (in read-only sections). In these cases, BOLT may convert loads into immediate-loading instructions, relieving pressure from the D-cache but possibly increasing pressure on the I-cache, since the data is now encoded in the instruction stream. BOLT’s policy, in this case, is to abort the promotion if the new instruction encoding is larger than the original load instruction, even if it means avoiding an arguably more computationally expensive load instruction. However, we found that such opportunities are not very frequent in our workloads.

reorder-bbs:

Pass 8, reorder and split hot/cold basic blocks, reorders basic blocks according to the most frequently executed paths, so the hottest successor will most likely be a fall-though, reducing taken branches and relieving pressure from the branch predic- tor unit

Profile-Guided Optimization (PGO)

I recommend checking out the following document written by my friend which can be found at https://misile00.github.io/notes/PGO. This document provides a detailed explanation of PGO and how it can be applied in practice.

Compiler Flags

Overview of compiler flags

Commonly used flags for optimization

LLVM POLLY

Definition and concept

How POLLY works

Other Optimization Techniques

Dead Code Elimination

Constant Propagation

Register Allocation

Loop Unrolling

Inlining

Instruction Scheduling

Vectorization

Interprocedural Optimization (IPO)