Cranelift, Part 3: Correctness in Register Allocation

This post is the last in a three-part series about Cranelift. In the first post, I covered overall context and the instruction-selection problem; in the second post, I took a deep dive into compiler performance via careful algorithmic design.

In this post, I want to dive into how we engineer for and work to ensure correctness, which is perhaps the most important aspect of any compiler project. A compiler is usually a complex beast: to obtain reasonable performance, one must perform quite complex analyses and carefully transform an arbitrary program in ways that preserve its meaning. It is likely that one will make mistakes and miss subtle corner cases, especially in the cracks and crevices between components. Despite all of that, correct code generation is vital because the consequences of miscompilation are potentially so severe: basically any guarantee (security-related or otherwise) that we make at a higher level of the system stack relies on the (quite reasonable!) assumption that the computer will execute the source code we have written faithfully. If the compiler translates our code to something else, then all bets are off.

There are ways that one can apply good engineering principles to reduce this risk. An extremely powerful technique derives from the insight that checking a result is usually easier than computing it, and if we randomly generate many inputs, run our compiler (or other program) on these inputs, and check its output, we can get to a statistical approximation of the claim "for all inputs, the compiler generates the correct output". The more random inputs we try, the stronger this statement becomes. This technique is known as fuzzing with a program-specific oracle, and I could write a lengthy ode to its uncanny power to find bugs (many others have, already).

In this post, I will cover how we worked to ensure correctness in our register allocator, regalloc.rs, by developing a symbolic checker that uses abstract interpretation to prove correctness for a specific register allocation result. By using this checker as a fuzzing oracle, and driving just the register allocator with a focused fuzzing target, we have been able to uncover some very interesting and subtle bugs, and achieve a fairly high confidence in the allocator's robustness.

What is Register Allocation?

Before we dive in, we need to cover a few basics. Most importantly: what is the register allocation problem, and what makes it hard?

In a typical programming language, a program can have an arbitrary number of variables or values in scope. This is a very useful abstraction: it is easiest to describe an algorithm when one does not have to worry about where to store the values.

For example, one could write the program:

void f() {
    int x0 = compute(0);
    int x1 = compute(1);
    // ...
    int x99 = compute(99);
    
    // ---
    
    consume(x0);
    consume(x1);
    // ...
    consume(x99);
}

At the midpoint of the program (the --- mark), there are 100 int-sized values that have been computed and are later used. When the compiler produces machine code for this function, where are those values stored?

For small functions with only a few values, it is easy to place every value in a CPU register. But most CPUs do not have 100 general-purpose registers for storing integers¹; and in general, most languages either do not place limits on the number of local variables or else impose limits that are much, much higher than the typical number of CPU registers. So we need some approach that scales beyond, say, about 16 values (x86-64) or about 32 values (aarch64) in use at once.

A very simple answer is to allocate a memory location for each local variable. In fact this is exactly what the C programming model provides: all of the xN variables above semantically live in memory, and we can take the address &xN. If one does this, one will find that the addresses are part of the stack. When the function is called, it allocates a new area on the stack called the stack frame and uses it to store local variables.

This is far from the best we can do, though! Consider what this means when we actually perform some operation on the locals. If we read two locals, perform an addition, and store the result in a third, like so:

x0 = x1 + x2;

then in machine code, because most CPUs do not have instructions that can read two in-memory values and write back a third in-memory result, we would need to emit something like the following:

ld r0, [address of x1]
ld r1, [address of x2]
add r0, r0, r1  // r0 := r0 + r1
st r0, [address of x0]

Compiling code in this way is very fast because we need to make almost no decisions: a variable reference always becomes a memory load, for example. This is how a "baseline JIT compiler" typically works, actually: for example, in the SpiderMonkey JS and Wasm JIT compiler, the baseline JIT tier -- which is meant to produce passable code very, very quickly -- actually keeps a stack of values in memory that correspond one-to-one to the JS bytecode or Wasm bytecode's value stack. (You can read the code here: it actually keeps a few of the most recent values, at the top of operand stack, in fixed registers and the rest in memory.)

Unfortunately, accessing memory multiple times for every operation is very slow. What's more, it is often the case that values are reused soon after being produced: for example, we might have

x0 = x1 + x2;
x3 = x0 * 2;

When we compute x3 using x0, do we reload x0's value from memory immediately after storing it? A smarter compiler should be able to remember that it had just computed the value, and should keep it in a register, avoiding the round-trip through memory altogether.

This is register allocation: it is assigning a value in the program to a register for storage. What makes register allocation interesting is that (as noted above) there are fewer CPU registers than the number of allowable program values, so we have to choose some subset of values to keep in registers. This is often constrained in certain ways: for example, an add instruction on RISC-like CPUs can only read from, and write to, registers, so a value's storage location must be a register immediately before it is used by a + operator. Fortunately, the location assignments can change over time, so that at different points in the machine code, a register can be assigned to hold different values. The job of the register allocator is to decide how to shuffle values between memory and registers, and between registers, so that at any given time the values that need to be in registers are so.

In our design, the register allocator will accept as input a type of almost-machine-code called "virtual-register code", or VCode. This has a sequence of machine instructions, but registers named in the instructions are virtual registers: the compiler can use as many of them as it needs. The register allocator will (i) rewrite the register references in the instructions to be actual machine register names, and (ii) insert instructions to shuffle data as needed. These instructions are called spills when they move a value from a register to memory; reloads when the move a value from memory back to a register; and moves when they move values between registers. The memory locations where values are stored when not in registers are called spill slots.

An example of the register-allocation problem is shown below on a program with four instructions:

Figure: Register allocation

This allocation is performed onto a machine with two registers (r0 and r1). On the left, the original program is written in an assembly-like form with virtual registers. On the right, the program has been modified to use only real registers.

Between each instruction, we have written a mapping from virtual registers to real registers. The register allocator's task is just ("just"!) to compute these mappings and then edit the instructions, taking their register references through these mappings.

Note that the program, at one point, has three live values, or values that still must be preserved because they will be used later: between the first and second instructions, all of v0, v1 and v2 are live. The machine has only two registers, so it cannot hold all live values in them; it must spill at least one. This is the reason for the spill instruction, written as a store to the stack slot [sp+0].

How Hard is Register Allocation?

In general, the register allocator will first analyze the program to work out which values are live at which program points. This liveness information and related constraints specify a combinatorial optimization problem: certain values must be stored somewhere at each point, constraints limit which choices can be made and some choices will conflict with some others (e.g., two values cannot occupy a register at the same time), and a set of choices implies some cost (in data movement). The allocator will solve this optimization problem as well as it can using heuristics of some sort, depending on the register allocator.

Is this a hard problem? In fact, it is not only hard in a colloquial sense, but NP-complete: this means that it is as hard as any other NP problem, for which we know only exponential-time brute-force algorithms in the worst case.² ³ The reason is that the problem does not have optimal substructure: it cannot be decomposed into non-interacting parts that can each be solved separately and then built up into an overall solution; rather, decisions at one point affect decisions elsewhere, potentially anywhere else in the function body. Thus, in the worst case, we can't do better than a brute-force search if we want an optimal solution.

There are many good approximations to optimal register allocation. A common one is linear-scan register allocation, which can run in almost-linear time (with respect to the code size). Allocators that can afford to spend more time are more complex: for example, in regalloc.rs, in addition to the linear-scan implementation (written by my brilliant colleague Benjamin Bouvier), we have a "backtracking" algorithm (written by my other brilliant colleague Julian Seward) that can edit and improve its choices as it discovers higher-priority uses for registers.

The details of how these algorithms work do not really matter here, except to say that they are very complicated and hard to get right. An algorithm that appears relatively simple at the conceptual level or in pseudocode quickly runs into interesting and subtle considerations as real-world constraints creep in. The regalloc.rs codebase is about 25K lines of deeply-algorithmic Rust code; any reasonable engineer would expect this to include at least several bugs! Compounding the urgency here, a register-allocation bug can result in arbitrary incorrect results, because the register allocator is in charge of "wiring up" all of the dataflow in the program. If we exchange one arbitrary value with another arbitrary value in the program, anything could happen.

How to Verify Correctness

So we want to write a correct register allocator. How do we even start on a task like this?

It might help to break down what we mean by "correct". Note that the register allocation problem has a nice property: the programs both before and after allocation have a well-defined semantics. In particular, we can think of register allocation as a transformation that converts programs running on an infinite-register machine (where we can use as many virtual registers as we want) to a finite-register machine (where the CPU has a fixed set of registers). If the original program on the infinite-register machine yields the same result as the transformed (register-allocated) program on the finite-register machine, then we have achieved a correct register allocation.

How do we test this equivalence?

Single-Program, Single-Input Equivalence

The simplest way to test whether two programs are equivalent is to run them and compare the results! Let's say we do this: for a single program, choose some random inputs, and run the virtual-registerized program alongside its register-allocated version on the appropriate interpreters. Compare register and memory state at the end.

What does it mean if the final machine states match? It means that for this one program, our register allocator produces a transformed program that is correct for this one program input. Note the two qualifications here. First, we have not necessarily shown that the register allocation is correct given another program input. Perhaps a different input causes a branch to go down another program path, and the register allocator introduced an error on that path. Second, we have not shown anything for any other program; we have only tested a single program and its register-allocated output.

We can attempt to address the first limitation -- correctness only under one input -- by taking more sample points. For example, we could choose a thousand random program inputs, and even drive this random choice with some sort of feedback that tries to maximize control-flow coverage or other "interesting" behavior (as fuzzers do). We could probably achieve reasonable confidence that this single register allocation result is correct, given enough test cases.

Figure: Checking a program with concrete inputs

However, this is still very expensive: we are asking to run the whole program N times to get a sample size of N. Even a single execution may be expensive: the program on which we have performed register allocation might be a compiler, or a videogame, for example.

Single-Program, For-all-Input Equivalence

Can we avoid the need to run the program at all to test that its register-allocated version is correct?

The answer is surprisingly simple: yes, we can, by simply altering the domain that the program executes on. Ordinarily we think of CPU registers as containing concrete numbers -- say, 64-bit values. What if they contained symbols instead?

By generalizing over program values with symbols, we can often represent the state of the system in terms of inputs without caring what those inputs are. For example, given the program:

ld v0, [A]
ld v1, [B]
ld v2, [C]
add v3, v0, v1
add v4, v2, v3
return v4

ld r0, [A]
ld r1, [B]
ld r2, [C]
add r0, r0, r1
add r0, r2, r0
return r0

without symbolic reasoning, we could store arbitrary integers to memory locations A, B and C and simulate the program's execution before and after register allocation, never seeing a mismatch, but this would not prove anything unless we iterated through all possible values. However, if we suppose that after the three loads, r0 contains v0 (as a symbolic value, whatever it is), r1 contains v1, and r2 contains v2, and that r0 contains v3 after the first add and v4 after the second add, we can see the correspondence by matching up the symbols.

Figure: Checking a program with symbolic values

This is a very simple example, and perhaps under-sells the insight and power of this approach; we will come back to it later when we talk about Abstract Interpretation below.

In any case, what we have shown is that for a single instance of the register-allocation problem, we can prove that it transformed the program in a correct way. Concretely, this means that the machine code that we generate will execute just as if we were interpreting the virtual-register code; if we can correctly generate virtual-register code, then our compiler is correct. That's excellent! Can we go further?

For-all-Programs Equivalence

We could prove a-priori that the register allocator will always transform any program in a way that is correct. In other words, we could abstract not only over the input values to the program, but over the program itself.

If we can prove this, then we have no need to run any sort of check at runtime. Abstracting over program inputs lets us avoid the need to run the program; we know the register allocation is correct for all inputs. In an analogous way, abstracting over the program to be register-allocated would let us avoid the need to run the register allocator; we know the register allocator is correct for all programs and for all inputs to those programs.

One can imagine that this is much harder. In fact, it has been done, but is a significant proof-engineering effort, and is a realm of active research: this basically requires writing a machine-verifiable proof that one's compiler algorithms are correct. Such proven-correct compilers exist: e.g., CompCert has been proven to compile C correctly to machine code for several platforms. Unfortunately, such efforts are strongly limited by the proof-engineering effort that is required, and thus this approach is unlikely to be feasible for a compiler unless it is their primary goal.

Our Approach: Allocator with Checker

Given all of the above, we choose what we believe is the most reasonable tradeoff: we build a symbolic checker for the output of the register allocator. This does not let us make a static claim that our register allocator is correct, but it does let us prove that it is correct for any given compiler run; and if we use this as a fuzzing oracle, we can build statistical confidence that it is correct for all compiler runs.

Checking the Register Allocator

Our overall flow is pictured below:

Figure: Augmenting register allocation with a checker

There are two ways in which we can add a register-allocator checker into the system. The first, on the left, we call "runtime checking": in this mode, every register allocator execution is checked and the machine code using the allocations is not permitted to execute (i.e. the compiler does not return a result) until the checker verifies equivalence. This is the safest mode: it provides the same guarantees as a proven-correct allocator ("for-all-programs equivalence" above). However, it imposes some overhead on every compilation, which may not be desirable. For this reason, while running the register allocator with the checker is a supported option in Cranelift, it is not the default.

The second mode is one in which we apply the checker to a fuzzing workflow, and is the approach we have generally preferred (we have a fuzz target in regalloc.rs that generates arbitrary input programs and runs the checker on each one; and we are running this continuously as part of Wasmtime's membership in Google's oss-fuzz continuous-fuzzing initiative). In this mode, we use the checker as an application-specific oracle for a fuzzing engine: as the fuzzing engine generates random programs (test cases), we run the register allocator over these programs, run the checker on the result, and tell the engine whether the register allocator passed or failed. The fuzzer will flag any failing test cases for a human developer to debug. If the fuzzer runs for a long time without finding any issues, we can then have more confidence that the register allocator is correct, even without running the checker; and the longer the fuzzer runs, the greater our confidence becomes. The application-specific oracle sigificantly improves over more generic fuzzer feedback mechanisms, such as program crashes or incorrect output: a register-allocator bug may not immediately manifest in incorrect execution, or when it does, the resulting crash may have no obvious connection to the actual mis-allocated register. The checker is able to point to a specific register use at a specific instruction and say "this register is wrong". Such a result makes for much smoother debugging!

Let's now walk through how we build the "checker" whose goal is to verify a particular register allocation is correct. We will come at the solution in stages, first reasoning about the easiest case -- straight-line code -- and then introducing control flow. At the end, we'll have a simple algorithm that runs in linear time (relative to code size) and whose simplicity allows us to be reasonably confident in its guarantees.

Symbolic Equivalence and Abstract Interpretation

Recall that we described above a sort of symbolic interpretation of execution: one can reason about CPU registers containing "symbolic" values, where each symbol represents a virtual register in the original code. For example, we can take the code

mov v0, 1
mov v1, 2
add v2, v0, v1
return v2

and a register-allocated form of that code

mov r0, 1
mov r1, 2
add r0, r0, r1
return r0

and somehow find a set of substitutions that makes them equivalent:

mov r0, 1
      [ r0 = v0 ]
mov r1, 2
      [ r1 = v1 ]
add r0, r0, r1
      [ r0 = v2 ]
return r0

But how do we solve for these substitutions? Recall that above we hinted at a form of execution that operates on symbols rather than values. We can simply take the semantics of the original instruction set, and reformulate it to operate on symbolic values instead, and then step through the code to find a representation of all executions at once. This is called symbolic execution, and with some enhancements described below, is the basis of abstract interpretation⁴. It is a very powerful technique!

What are the semantics of the instruction set that are relevant here? It turns out, because the register allocator does not modify any of the program's original instructions⁵, we can understand each instruction as mostly an arbitrary, opaque operator. The only important pieces of information are which registers it reads (before its operation) and which it writes (after its operation).⁶

It turns out that to verify the output of the register allocator when it spills values, and when it moves values between registers, we need to have special knowledge of spills, reloads, and moves. Hence, we can reduce the input program to a sort of minimal ISA that captures only what is important for symbolic reasoning (the real definition is here; we simplify a bit for this post):

Spill <spillslot>, <CPU register>: copy data (symbol representing virtual register) from a register to a spill slot.
Reload <CPU register>, <spillslot>: copy data from a spill slot to a register.
Move <CPU register>, <CPU register>: move data from one CPU register to another (N.B.: only regalloc-inserted moves are recognized as a Move, not moves in the original input program.)
Op read:<CPU register list>, read_orig:<virtual register list> write:<CPU register list> write_orig:<virtual register list>: some arbitrary operation that reads some registers and writes some other registers.

The last instruction is the most interesting: notice that it carries the original virtual registers as well as the post-register-allocation CPU registers for the instruction. The need for this will become clearer below, but the intuition is that we need to see both in order to establish the correspondence between the two.

We can produce the above instructions while the register allocator is scanning over the code and editing it; that part is a straightforward translation. Once we have the abstracted program, we can "execute" it over the domain of symbols. How do we do this? With the following rules:

We maintain some state, just as a real CPU does: for each CPU register, and for each location in the stack frame, we track a symbol (rather than an integer value). This symbol can be a virtual-register name, if we know that the storage location currently contains that register's value. It can also be Unknown, if the checker doesn't know, or Conflicted, if the value could be one of several virtual registers. (The difference between the latter two will become clear when we discuss control-flow below. For now it's enough to see that we abstract the state to: either we know the slot contains a program value, symbolically, or we know nothing.)
When we see a Spill, Reload, or Move, we copy the symbolic state from the source location (register or spill slot) to the destination location. In other words, we know that these instructions always move the integer value of a register or memory word, whatever it may be; so if we have knowledge about the source location, symbolically for all possible executions, then we can extend that knowledge to the destination as well.
When we see an Op, we do some checks then some updates:
- For each read (input) register, we examine the symbolic value stored in the given CPU register (post-allocation location). If that symbol matches the virtual register that the original instruction used, then the allocator has properly conveyed the virtual register's value to its use here, and thus the allocation is correct (preserves program dataflow). If not, we can signal a checker error, and look for the bug in our register allocator. We know for sure it must be a bug (i.e., there are no false positives), because we only track a symbol for a storage location when we have proven (for all executions!) that that storage must contain that virtual register.
- For each write (output) register, we set the symbolic value stored in the given CPU register to be the given (pre-allocation) virtual register. In other words, each write produces a symbol. This symbol then flows through the program, moving via spills/reloads/moves, until it reaches consumers.

And that's it! We can prove in a fairly straightforward way that this is exactly correct -- produces no false positives or false negatives -- for straight-line code (code with no jumps). We can do this by induction: if the symbolic state is correct before an instruction, then the above rules just encode the data movement that the concrete program performs, and the symbolic state will be updated in the same way, so the symbolic state after the instruction is also correct.

Note that this is linear as well -- so it's very fast, with a single scan over straight-line code. This is possible because we have help from the register allocator: we know about spills, reloads, and register allocator-inserted moves, and we have pre- and post-allocation registers for all other instructions. Consider what we would have to do if we did not know about these, but only saw machine instructions. In that case, any load, store or move instruction could have come from the allocator or from the original program. We would have nothing but a graph of operators with connectivity between them, and we would have to solve a graph isomorphism problem. That is much harder, and much slower!

So are we done? Not quite: we have only considered straight-line code. What happens when we encounter a jump?

Control-Flow Joins, Lattices, and Iterative Dataflow Analysis

Control-flow makes analysis interesting because it allows for multiple possibilities. Consider a simple program with an if-then-else pattern (a "control-flow diamond", as it is sometimes called, due to its shape):

Figure: A control-flow diamond and symbolic analysis

Let's say that a symbolic analysis decides that on the left branch, r0 has symbolic state A, and on the right branch, it has symbolic state B. What state does it have in the lower block, after the two paths re-join?

We can give a precise answer if we are allowed to "predicate", or make the answer conditional on some other program state. For example, if we knew that the if-condition were represented by some symbol C that has a boolean type, we could invent an abstract expression language and then write if C { A } else { B }.

However, this quickly becomes untenable. We will find that programs with loops lead to unbounded symbolic expressions. (To see this, consider that a symbolic representation can have a size larger than its inputs. Any cyclic data dependency around a loop could thus generate an infinitely-large symbolic representation.) Even with only acyclic control flow, path-sensitive symbolic expressions can grow exponentially with program size: consider that a program with N basic blocks and no loops can have O(2^N) paths through those blocks, and fully precise symbolic expressions would need to capture the effects of each of those paths.

We thus need some way to approximate. Note that an abstract interpretation of a program need not precisely capture all of the program's behavior losslessly. For example, we might perform a simple abstract interpretation analysis that only tracks possible numeric signs (positive, negative, unknown) for integer variables. So it is always fine to "summarize" and drop detail to remain tractable. Let us thus consider how we might "merge" state when multiple possibilities exist.

It turns out that there is a very nice mathematical object that captures the notion of "merging" in a way that is very useful: the lattice. A lattice consists of a set of elements and a partial order between them, together with a least element "bottom" and a greatest element "top", an operator called "meet" that finds the "greatest lower bound" for any two elements (the largest element that is less than its two operands) and a "join" that finds the "least upper bound" (the dual of the above).

Figure: A lattice

(Figure credit: Wikimedia, CC BY-SA 3.0)

An extremely useful property of lattices is that their merging operations of meet and join are commutative, associative and reflexive. This is a formal way of saying that the result only depends on the set of elements "thrown into the mix", in any order and with any repetition. In other words, the meet of many elements is a function only of the set of elements, not of the order in which we process them.

How is this useful? If we define particular analysis states -- and as a reminder, in our specific case, these are maps from CPU registers and spillslots to symbolic virtual registers -- to be lattice elements, and define a "meet function" that somehow merges the states -- then we can use this merging behavior to implement a sort of program analysis over all programs, including those with loops, without unbounded analysis growth! This is called the "meet-over-all-paths" solution and is a standard way that compilers perform dataflow analysis today.⁷

To understand how a lattice describes "merging" in a program analysis in a useful way, one can see the lattice ordering relation (the arrows in the figure above) as denoting that one state is more or less refined (contains more or less knowledge) than another. One starts at the "greatest" or "top" element: anything could be true; we know nothing. We then move to progressively more refined states. One analysis state is ordered "less than" another if it captures all the constraints we have learned in the other state, plus some new ones. The "meet" operator, which computes the greatest lower bound, will thus give us an analysis state that captures all of the knowledge in both inputs, and no more.⁸

The general approach to performing an analysis on an arbitrary CFG is as follows:

We define our analysis state as a lattice.
We trace the current analysis state at each program point, or point between instructions. Initially, the state at every program point is the "top" lattice element; as values meet, they move "down" the lattice, toward the "bottom" element.
We process the effect of each instruction, computing the state at its post-program-point from its pre-program-point.
When analysis state reaches a control-flow edge, we propagate the state across the edge, and meet it with the incoming state from all other edges. This may then lead us to recompute states in the destination block.
We run a "fixpoint" loop, processing updates as analysis states at block entries change, until no more changes occur.

In this way, we find a solution to the dataflow problem that satisfies all of the instruction semantics for any path through the program. It may not be fully precise (i.e., it may not answer every question) -- because it is often impossible to capture a fully precise answer for executions that include loops, and impractical for programs with significant control-flow -- but it is sound, in the sense that any claims we make from the analysis result will be correct.

A Register Checker as a Dataflow Analysis Problem

We now have all of the pieces that we need in order to check the register-allocator output for any program. We saw above that we could model the machine state symbolically for any straight-line code, which allows us to detect register allocator errors exactly (no false negatives and no false positives) as long as there is no control flow. We then discussed the usual static analysis approach to control flow. How can we combine the two?

The answer is that we define a lattice of symbolic register state, and then walk through the same per-instruction semantics as above in a fixpoint dataflow analysis. Put simply, for each storage location (CPU register or spill slot), we have a lattice:

Figure: Abstract value lattice

The "unknown" state is the "top" lattice value. This means simply that we don't know what is in the register because the analysis hasn't converged yet (or no write has occurred).

The "conflicted" state is the "bottom" lattice value. This means that two or more symbolic definitions have merged. Rather than try to represent a superposition of both with some sort of predication or loop summary, we simply give up and move to a state that indicates "bad value". This is not a checker error as long as it is never used, and it can be overwritten with a good value at any time; but if the value is used as an instruction source, then we flag an error.

The meet function, then, is very simple: two registers meet to "conflicted" unless they are the same register; "unknown" meets with anything to produce that anything; and "conflicted" is contagious, in the sense that meeting any other state with "conflicted" remains "conflicted".

Note that we said above that our analysis state is a map from registers and spill slots to symbolic states; not just a single symbolic state. So our lattice is actually a product of each individual storage location's state, and we meet symbols piecewise. (The resulting map contains entries only for keys that appear in all meet-inputs; i.e. we take the intersection of the domains.)

With the analysis state and its meet-function defined, we run a dataflow analysis loop, allow it to converge, and look for errors; and we're done!

And that's it!⁹

Effectiveness: Can it Find Bugs?

The short answer is that yes, it can find some pretty subtle bugs!

The benefit of the regalloc.rs checker is twofold:

It has found real bugs. In the above example, there was a conceptual error in the reference-types (precise GC rooting) support: in certain cases where a spillslot was allocated for a pointer-typed value but never used, it could be added to the stackmap (list of pointer-typed spillslots) provided to the GC. This bug needs a specific set of circumstances to happen: we have to have enough register pressure that we decide to allocate a spillslot for a virtual register, but then hit the (rare) code-path in which we don't actually need to do the spill because a register became available. We never hit this in our other, hand-written tests of GC (Wasm reference types), despite some pretty extensive tests at least in SpiderMonkey's WebAssembly test suite driving the Cranelift backend. The fuzzer was able to drive toward full coverage, hit this rare code-path, and then allow the checker to discover the error.
It serves as a gold-standard test while developing new register allocators. Feedback while developing the linear-scan allocator (whose reference-type / precise GC rooting support came a bit later than the backtracking allocator's) indicated that the checker found many real issues and allowed for faster and more confident progress.

It's surprisingly difficult to find prior work on checkers that validate individual runs of a register allocator. There are several fully-verified compilers in existence; CompCert and CakeML are two that can compile realistic languages (C and ML, respectively). These compilers have fully verified register allocators in the sense that the algorithm itself is proven correct; there is no need to run a checker on an individual compilation result. The engineering effort to achieve this is much higher than to write a checker, however (in the latter case, ~700 lines of Rust).

CakeML's approach to proving the register allocator correct is described by Tan et al. in "The Verified CakeML Compiler Backend" (J. Func Prog 29, 2019). They appear to have nicely factored the problem so that the compilation is correct as long as a valid graph coloring or "permutation" (mapping of program values to storage slots) is provided. This allows reasoning about the core issue (dataflow equivalence before and after allocation) separately from the details of the allocator (graph coloring algorithm).

Proof-producing compilers exist as well: for example, Crellvm is a recent extension of several LLVM passes that generates a (machine-checkable) correctness proof alongside the transformed program. This approach is conceptually at the same level as our register-allocator checker: it results in the validation of a single compiler run, but is much easier to build than a full a-priori correctness proof. This effort does not yet appear to address register allocation, however.

Rideau and Leroy in "Validating Register Allocation and Spilling" (CC 2010) describe a similar taxonomy to ours, separating "once and for all" correctness proofs from "translation validation checks" and providing the latter. Their validator, however, defines a fairly complex transfer function that builds a set of equality constraints that must be solved. It appears that the validator does not leverage hints from the allocator, specifically w.r.t. spills, reloads and inserted moves as distinguished from stores, loads and moves in the original program; without these hints, a much more general and complex dataflow-equivalence scheme is needed.

Nandivada et al. in "A Framework for End-to-End Verification andEvaluation of Register Allocators" (SAS 2007) describe a system very similar to our checker in which physical register contents (as virtual-register or "pseudo" symbols) are encoded into a post-regalloc IR that is then typechecked. Their typechecker can uncover the same sorts of regalloc errors that our checker can. Thus, their approach is largely equivalent to ours; the main difference is that we do not encode the problem as typechecking on a dedicated IR but rather a standalone static analysis.

Conclusion

This post concludes the three-post series (one, two) describing the work we've done to develop all the pieces of Cranelift's new backend over the past year! It has been a very interesting and educational ride for me personally; I discovered an entirely new world of interesting problems to solve in the compiler backend, as distinct from the "middle end" (IR-level optimizations) that is more commonly taught and studied. Additionally, the focus on fast compilation is an interesting twist, and one that I believe is not studied enough. It is easy to justify higher analysis precision and better generated code through ever-more-complex techniques; the benefit to be found in design tradeoffs for fast compilation is more subjective and more dependent on workload.

It is my hope that these writeups have illuminated some of the thinking that went into our design decisions. Our work is by no means done, however! The roadmap for Cranelift work in 2021 lists a number of ideas that we've discussed to achieve higher compiler performance and better code quality. I am excited to explore these more in the coming year; they may even result in more blog posts. Until then, happy compiling!

For discussions about this post, please feel free to join us on our Zulip instance in this thread.

Thanks to /u/po8 on Reddit for several suggestions which I have incorporated. Thanks also to bjorn3 for several suggestions. Finally, thanks to Fernando M Q Pereira for bringing my attention to his paper in SAS 2007 that proposes a very similar idea, which I've added to the related work section. Any and all feedback is welcome!

Why do CPUs have a limited number of registers? The bound is mostly due to ISA encoding limitations: there are only so many bits in an instruction to name a particular register source or destination. When the CPU designer chooses how many registers to define, providing more will improve performance (up to a point) because the CPU can hold more state at one time, but will also impose an increasing cost in code size and CPU complexity.

Computer architect's tangent: due to register renaming, a modern high-performance out-of-order CPU will have many more physical registers, with architectural register names mapped to physical registers at any given program point by the register-renaming hardware (in common parlance, the register allocation table or RAT), but the ISA encoding restrictions limit the number that have architectural names at any time. The existence of register renaming sometimes causes confusion in discussions of register allocation -- why rename onto so few registers when we have so many? -- well, we could do much better if we had more bits to refer to them all! Architectural standardization is another reason for this: we would not want to recompile code every time the PRF (physical register file) became larger. Simpler to say "x86-64 has 16 integer registers" and be done with it.¹⁰

We don't know if exponential time is the best we can do in the worst case, though most computer scientists suspect so. This is the famous P=NP problem, and if you can solve it, you win a million dollars.

A slight correction from /u/po8's comment: register allocation on structured programs can be done in polynomial time, i.e., better than an exponential brute-force search. However, the problem remains quite complex!

⁴

Abstract interpretation was introduced by Radhia and Patrick Cousot in their seminal 1977 POPL paper "Abstract interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints" (pdf).

⁵

Except for move elimination, but we can ignore that for now -- it is possible to adapt the abstract interpretation rules to account for it later.

⁶

In regalloc.rs we also have a notion of an instruction that "modifies" a register, which is like a combined read and write except that the value must be mapped to the same register for both. This isn't fundamental to the point we're illustrating so we'll skip over it for now.

⁷

This dataflow analysis approach was proposed by Gary Kildall in the POPL 1973 paper "A unified approach to global program optimization". (He is perhaps better-known for writing the microcomputer OS CP/M, a predecessor to DOS.) Kildall's Dataflow analysis builds on the control-flow graph ideas invented several years prior by Fran Allen; in her 1970 paper Control Flow Analysis, she proposes interval-based dataflow analysis, which is the other main approach known and used today.

⁸

Note that we have been somewhat vague here about directionality. What does "more constrained" or "more refined" mean? There are actually two directions an analysis may work, and these have to do with how it handles imprecision. A "may-analysis", or "widening analysis", computes what the program may do. It generally begins with an "empty set" of sorts -- a variable has no possible values, a statement has no side-effects, a register contains nothing -- and then uses a union-like meet operator to aggregate all possibilities. The real program behavior will be some subset of these possibilities. In contrast, a "must-analysis", or "narrowing analysis", computes only what we know the program must do. It generally begins with the "universe set" and then uses intersection-like meet operators. The real program's behavior is a superset of this analysis's description. We can't have both, usually, because an analysis cannot generally be fully precise.

By convention, we always start analysis values at "top" and use "meet" to move down the lattice as the analysis converges, though we could just as well start at "bottom" and move up with "join", since flipping the lattice's order relation and swapping meet and join produces another lattice.

⁹

Well, not quite, as you might have guessed. One significant detail I've omitted is how we handle reference types and precise garbage collection. Precise GC rooting entails tracking a specific kind of type information for each register and spillslot: specifically, whether each storage location contains a pointer that the GC should observe when it performs a garbage collection. It is important in many applications for this to be "precise", which means that we can only say that a register contains a pointer if it actually does, and we must include all registers that contain pointers. Precision is important because the GC will assume any root pointer it traces points to a valid object (so false positives are bad); and must know about every pointer in case it is a moving GC and relocates an object (so false negatives are bad).

In our particular variant of the problem, we need this information at safepoints: these are points at which the GC could be invoked. (It would be too expensive to plan for a GC invocation at every point in the program.) Furthermore, we needed to support GCs that could only trace pointers on the stack (hence, spillslots), not in registers. So we needed to induce additional spills around safepoints to ensure pointers were only live on the stack, not in registers.

To check this, we extended the abstract value lattice to note whether each virtual register is a pointer-typed value or not. Then, at every safepoint, we (i) ensure that every actual pointer-typed value in a spillslot is listed in the stackmap provided to the GC, and (ii) clear any other pointer-typed stack location not listed in the stack map to an Unknown state. Why the latter? Because an actual pointer-typed value in a stack slot might be "dead" (not used again), and so is legal to omit from the stackmap; instead of immediately flagging an error when one is excluded, we simply ensure that a later use of it is invalid.

¹⁰

Note that some computer architectures do task the compiler with some form of register renaming. For example, the Intel Itanium (IA-64) had a novel sort of "rotating register reference" feature for loops, and trusted the compiler with managing a full 128 integer and 128 floating-point registers. Modern GPUs also have thousands of "registers" managed by the compiler.