Cranelift, Part 4: A New Register Allocator

This post is the fourth part of a three-part series¹ describing work that I have been doing to improve the Cranelift compiler. In this post, I'll describe the work that went into regalloc2, a new register allocator I developed over the past year. The allocator started as an effort to port IonMonkey's register allocator to Rust and a standalone form usable by Cranelift ("how hard could it be?"), but quickly evolved during a focused optimization effort to have its own unique design and implementation aspects.

Register allocation is a classically hard (NP-hard!) problem, and a good solution is mainly a question of concocting a suitable combination of heuristics and engineering high-performance data structures and algorithms such that most cases are good enough with few enough exceptions. As I've found, this rabbithole goes infinitely deep and there is always more to improve, but for now we're in a fairly good place.

We recently switched over to regalloc2, and Cranelift 0.84 and the concurrently released Wasmtime 0.37 use it by default. Some measurements show that it generally improves overall compiler speed (which was and is dominated by regalloc time) by 20%-ish, and generated code performance improves on register pressure-impacted benchmarks up to 10-20% in Wasmtime. In Cranelift's use as a backend for rustc via rustc_codegen_cranelift runtime performance improved by up to 7%. The allocator seems to have generally fewer compile-time outliers than our previous allocator, which in many cases is a more important property than 10-20% improvements. Overall, it seems to be a reasonable performance win with few downsides. Of course, this work benefits hugely from the lessons learned in developing that prior allocator, regalloc.rs, which was work primarily done by Julian Seward and Benjamin Bouvier; I learned enormous amounts talking to them and watching their work on regalloc.rs in 2020, and this work stands on their shoulders as well as IonMonkey's.

This post will make a whirlwind tour through several topics. After reviewing the register allocation problem and why it is important, we will learn about regalloc2's approach: its abstractions, its key features, and how its passes work. We'll then spend a good amount of time on "lessons learned": how we attained reasonable performance; how we managed to make anything work at all in reasonable development time; how we migrated a large existing compiler codebase to new foundational types and invariants; and some perspective on ongoing tuning and refinements.

A design document for the allocator exists as well, and this blogpost is meant to be complementary: we'll walk through some interesting bits of the architecture, but anyone hoping to actually grok the thing in its entirety (and please talk to me if this is you!) is advised to dig into the design doc and the source for the full story.

Finally, a fair warning: this post has become a bit of a book chapter; if you're looking for a tl;dr, you can skip to the Lessons section or the Conclusions.

Register Allocation: Recap

First, let's recap what register allocation is and why it's important.²

The basic problem at hand is to assign storage locations to program dataflow. We can imagine our compiler input as a graph of operators, each of which consumes some values and produces others (let's ignore control flow for the moment):

Figure: A dataflow graph

Some compilers directly represent the program in this way (called a "sea of nodes" IR) but most, including Cranelift, linearize the operators into a particular program order. And in fact, by the time that the register allocator does its work, the "operators" are really machine instructions, or something very close to them, so we will call them that: we have a sequence of instructions, and program points before and after each one. Even in this new view, we still have the dataflow connectivity that we did above; now, each edge corresponds to a particular range of program points:

Figure: Dataflow graph with liveranges

We call each of these dataflow edges, representing a value that must flow from one instruction to another, a liverange. We say that virtual registers -- the names we give the values before regalloc -- have a set of liveranges.

With control flow, liveranges might be discontiguous from a linear-instruction-order point of view, because of jumps; for example:

Figure: Control flow with discontiguous liveranges

Each instruction requires its inputs to be in registers and produces its outputs in registers, usually.³ So, the job of the register allocator is to choose one of a finite set of machine registers to convey each of the liveranges from its definition(s) to all of its use(s).

Why is this hard? In brief, because we may not have enough registers. We thus enter a world of compromise: if more values are "alive" (need to be kept around for later use) than the number of registers that the CPU has, then we have to put some of them somewhere else, and bring them back into registers only when we actually need to use them. That "somewhere else" is usually memory in the function's stack frame that the compiler reserves (a "stackslot"), and the process of saving values away to free up registers is called "spilling".

One more concept before we go further: we may want to choose to place a liverange in different places throughout its lifetime, depending on the needs of the instruction sequence at certain points. For example, if a value is produced at the top of a function, then dormant (but live) for a while, and then used frequently in a tight loop at the bottom of the function, we don't really want to spill it, and reload it from memory every loop iteration. In other words, given this program:

    v0 := ...
    
    v1 := ...    // lots of intermediate defs that use all regs
    v2 := ...
    ...
    vN := ...
    
    
loop:
    v100 := ...
    v101 := add v0, v100
    store v101, ...
    jmp loop

we ideally do not want to assign a stack slot to the value v0 and then produce machine code like

    add rax, rbx      ;; `v0` stored in `rax`
    mov [rsp+16], rax ;; spill `v0` to a stack slot
    ...
    
loop:
    ...
    mov rax, [rsp+16] ;; load `v0` from stack on every iteration -- expensive!
    add rcx, rax
    mov [...], rcx
    jmp loop

but if we only choose a location per liverange, we either choose a register, or a stackslot -- no middle ground. Intuitively, it seems like we should be able to put the value in a different place while it is "dormant" (spill it to the stack, most likely), then pick an optimal location during the tight loop. To do so, we need to refer to the two parts of the liverange separately, and assign each one a separate location. This is called liverange splitting.

If we split liveranges, we can then do something like:

    add rax, rbx      ;; `v0` stored in `r0`
    mov [rsp+16], rax ;; spill `v0` to a stack slot
    ...
    
    mov rdx, [rsp+16] ;; move `v0` from stackslot to a new liverange in `rdx`
loop:
    ...
    add rcx, rdx      ;; no load within loop
    mov [...], rcx
    jmp loop

This seems quite powerful and useful -- so why doesn't every register allocator do this? In brief, because it makes the problem much much harder. When we have a fixed number of liveranges, we have a fixed amount of work, and we assign a register per liverange, probably in some priority order. And then we are done.

But as soon as we allow for splitting, we can increase the amount of work we have almost arbitrarily: we could split every liverange into many tiny pieces, greatly multiplying the cost of register allocation. A well-placed split reduces the constraints in the problem we're solving, making it easier, but too many splits just increases work and also the likelihood that we will unnecessarily move values around.

Splitting is thus the kind of problem that requires finely-tuned heuristics. To be concrete, consider the example above: we showed a split outside of the tight inner loop. But a naive splitting implementation might just split before the use, putting a move from stack to register inside the inner loop. Some sort of cost model is necessary to put splits in "cheap" places.

With all of that, hopefully you have some feel for the problem: we compute liveranges, we might split them, and then we choose where to put them. That's (almost) it -- modulo many tiny details.

regalloc2's Design

At a high level, regalloc2 is a backtracking register allocator that computes precise liveranges, performs merging according to some heuristics into "bundles", and then runs a main loop that assigns locations to bundles, sometimes splitting them to make the allocation problem easier (or possible at all). Once every part of every liverange has a location, it inserts move instructions to connect all of the pieces.

Let's break that down a bit:

regalloc2 starts with precise liveranges. These are computed according to the input to the allocator, which is a program that refers to virtual registers and may be in SSA form (one definition per register) or non-SSA (multiple definitions per register).
It then merges these liveranges into larger-than-liverange "bundles". If done correctly, this reduces work (fewer liverange bundles to process) and also gives better code (when merged, it is guaranteed that the related pieces will not need a move instruction to join them).
It then builds a priority queue of bundles, and processes them until every bundle has a location. (In simplified terms, regalloc2's entire job is to "assign locations to bundles".) This processing may involve undoing, or backtracking, earlier assignments, and may also involve splitting bundles into smaller bundles when separate pieces could attain better allocations.

We'll explain each of these design aspects in turn below.

Input: Instructions with Operands

First, let's talk about the input to the register allocator. To understand how regalloc2 works, we first need to understand how it sees the world. (Said another way, before we solve the problem, let's define it fully!)

regalloc2 processes a "program" that consists of instructions that refer to virtual registers rather than real machine registers. These instructions are arranged in a control-flow graph of basic blocks.⁴

The most important principle regarding the allocator's view of the program is: the meaning of instructions is mostly irrelevant. Instead, the allocator cares mainly how a particular instruction accesses program values as registers: which values and how (read or written), and with what constraints on location. Let's look more at the implications of this principle.

Constraints

The allocator views the input program as a sequence of instructions that use and define virtual registers. Every access to a register managed by the regalloc (an "allocatable register") must be via a virtual register.

Already we see a divergence from common ISAs like x86: there are instructions that implicitly access certain registers. (One form of the x86 integer multiply instruction always places its output in rax and rdx, for example.) Since these registers are not mentioned by the instruction explicitly, one might initially think that there is no need to create regalloc operands or use virtual registers for them. But these registers (e.g. rax and rdx) can also be used by explicit inputs and outputs to instructions; so the regalloc at least needs to know that the registers will be clobbered, and at some later point presumably the results will be read and the registers become free again.

We solve this problem by allowing constraints on operands. An instruction that always reads or writes a specific physical register still names a virtual-register operand. The only difference from an ordinary instruction that can use any register is that this operand is constrained to a particular register. This lets the allocator uniformly reason about virtual registers allocated to physical registers as the basic way that space is reserved; the constraint becomes only a detail of the allocation process.

Let's take an x86 instruction mul (integer multiply) as an example to see how this works. Ordinarily, one would write the following in assembly:

    ;; Multiplicand is implicitly in rax.
    mul rcx  ; multiply by rcx
    ;; 128-bit wide result is implicitly placed rdx (high 64 bits)
    ;; and rax (low 64 bits).

The instruction mul rcx does not tell the whole story from regalloc2's point of view, so we would instead present an instruction like so to the register allocator, with constraints annotating uses/definitions of virtual registers:

    ;; Put inputs in v0 and v1.
    mul v0 [use, fixed rax], v1 [use, any reg], v2 [def, fixed rax], v3 [def, fixed rdx]
    ;; Use results in v2 and v3.

The allocator will "do the right thing" by either inserting moves or generating inputs directly into, and using outputs directly from, the appropriate registers. The advantage of this scheme is that aside from the constraints, it makes mul behave like any other instruction: it isolates complexity in one place and presents a more uniform, easier-to-use abstraction for the rest of the compiler.

"Modify" Operands and Reused-Input Constraints

The next difference we might observe between a real ISA like x86 and a compiler's view of the world is: an operator in a compiler IR usually produces its result as a completely new value, but real machine instructions often modify existing values.

For example, on x86, arithmetic operations are written in two-operand form. They look like add rax, rbx, which means: compute the sum of rax and rbx, and store the result in rax, overwriting that input value.

The register allocator reasons about segments of value dataflow from definitions (defs) to uses; but the use of rax in this example seems to be neither. Or rather, it is both: it consumes a value in rax, and it produces a value in rax.

But we can't decompose it into a separate use and def either, because then the allocator might choose different locations for each. The encoding of add rax, rbx only has slots for two register names: the input in rax and output in rax must be in the same register!

We solve this by introducing a new kind of constraint: the "reused input register" constraint.⁵ At the regalloc level, we say that the add above has three operands: two inputs (uses) and an output (a def). It exactly corresponds to the compiler IR-level operator, with nicely separated values in different virtual registers. But, we constrain the output by indicating that it must be placed in the same register as the input. We can assert that this is the case when we get final assignments from the regalloc, then emit that register number into the "first source and also destination" slot of the instruction.⁶

So, instead of add rax, rbx (or add v0, v1 with v0 a "modify" operand), we can present the following 3-operand instruction to the register allocator:

    add v0 [use, any reg], v1 [use, any reg], v2 [def, reuse-input(0)]

This corresponds more closely to what the compiler IR describes, namely a new value for the result of the add and non-destructive uses of both operands. All of the complexity of saving the destructive source if needed is pushed to the allocator itself.

Program Points, "Early" and "Late" Operands

Finally, we need to go a bit deeper on what exactly it means to allocate a register "at" an instruction. To see why there may be some subtlety, let's consider an example. Take the instruction movzx on x86: this instruction does a 16-to-64-bit zero-extend, with one input and one output. In pre-regalloc form with virtual registers, we could write movzx v1, v0, reading an input in v0 and putting the output in v1.

An intuitive understanding of liveranges and the allocation problem might lead us to reason: both v0 and v1 are "live" at this instruction, so they overlap, and have to be placed in different registers.

                          v0    v1
                           :
    v1 := movzx v0         |     |
                                 |
                                 :

But an experienced assembly programmer, knowing that v0 is not used again after this instruction, might reuse its register for the output. So for example, if it were initially in r13, one might write movzx r13, r13w (the r13w is x86's archaic way of writing "the 16 bit version of r13").

But isn't this an invalid assignment, because we have put two liveranges in the same register r13 when they are both "live" at this particular instruction?

It turns out that this will work fine, for a subtle reason: generally instructions read all of their inputs, then write all of their outputs. In other words, there is a sort of two-phase semantics to most instructions. So we could say that the input v0 is live up to, and including, the "read" or "early" phase of this instruction, and the output v1 is live starting at the "write" or "late" phase of this instruction. These two liveranges don't conflict at all! So the above figure showing liveranges overlapping becomes:

                          v0    v1
                           :
                   EARLY   |
    v1 := movzx v0               
                   LATE         |
                                :

regalloc2 (along with most other register allocators) thus has a notion of "when" an operand occurs -- the "operand position" -- and it calls these two points in an instruction Early and Late. Along with this, throughout the allocator, we name program points (distinct points at which allocations are made) as Before or After a given instruction.

One final bit of subtlety: when a single instruction from a regalloc point of view actually emits multiple instructions at the machine level, sometimes the usual phasing of reads and writes breaks down. For example, maybe a pseudoinstruction becomes a sequence that starts to write outputs before it has read all of its inputs. In such a case, reusing one of the inputs (which is no longer live at Late) as an output register could be catastrophic. For this reason, regalloc2 decouples an operand's position from its kind (use or def). One could have an "early def" or a "late use". Temporary registers are also possible: these are live during both early and late points on an instruction so they do not conflict with any input or output, and can be used in sequences emitted from one pseudoinstruction.

regalloc2's View of Operands

To summarize, each instruction can have a sequence of "operands", each of which:

Names a virtual register that corresponds to a value in the original program;
Indicates whether this value is read ("used") or written ("defined"),
Indicates when during the instruction execution the value is accessed ("early", before the instruction executes; or "late", after it does);
Indicates where the value should be placed: in a machine register of a certain kind, or a specific machine register, or in the same register that another operand took, or in a slot in the stack frame.

Stage 1: Live Ranges

We've described what the register allocator expects as its input. Now let's talk about how the input is processed into an "allocation problem" that can be solved by the main algorithm.

The input is described as a graph of blocks of instructions, with operands; but most of the allocator works in terms of liveranges and bundles of liveranges instead.

In brief, a liverange (originally "live range", but we say it so often it has become one word!) is a span of program points -- that is, a range of "before" and "after" points on instructions -- that connects a program value in a virtual register from a definition to one or more uses. A liverange represents one unit of needed storage, either as a register or a slot in the stackframe.

The basic strategy of regalloc2 is to reduce the input into liveranges as soon as possible, and then operate mostly on liveranges, translating back to program terms (inserted moves and assigned registers per instruction) only at the very end of the process. This lets us reason about a simpler "core problem" that is actually quite concisely specified:

A liverange is a span of program points, which can be numbered consecutively;
A liverange has constraints at certain points that arise from program uses/defs;
We must assign locations to liveranges, such that:
- At any point, at most one liverange lives in a given location;
- At all points, a liverange's constraints are satisfied;
We are allowed to split a liverange into pieces and assign different locations to each piece. However, moves between pieces have a cost, and we must minimize cost.

And that's it! No need to reason about ISA specifics, or the way that regalloc2 generates moves, or anything else. We'll worry about generating moves to "reify" (make real) the assignments later. For now, we just need to slot the liveranges into locations and avoid conflicts.

Computing Liveness

To build up our set of liveranges, we first need to compute liveness. This is a property of any particular virtual register at a program point indicating that it has a value that will eventually be used.

Liveness analysis is an iterative dataflow analysis that is computed in the backward direction: any use of a virtual register propagates liveness backward ("upward" in the program), and a definition of that virtual register's value ends the liveness (when scanning upward), because the old value (from above) is no longer relevant.

Thus the first thing that regalloc2 does with the input program is to run a worklist algorithm to compute precise liveness. This produces a bitset⁷ that, at each basic block entry and exit, gives us the set of live virtual registers.

Once we know which registers are live into and out of each basic block, we can perform block-local processing to compute actual liveranges with each use of the register properly noted. This is another backward scan, but this time we build the data structures we'll use for the rest of the allocation program.

Normalization, and Saving Fixups for Later

We mentioned above that one way to see the liverange-building step is as a simplification of the problem to its core essence, in order to more easily solve it. "Ranges that may overlap" is certainly simpler than "instructions that access registers with certain semantics". However, even the constraints on the liveranges can be made simpler in several ways.

A good example of a complex set of constraints is the following:

    inst v0 [use, fixed r0], v0 [use, fixed r1], v1 [def, any reg]

This is an instruction that has two inputs, and takes the inputs in fixed physical registers r0 and r1. This is completely reasonable and such instructions exist in real ISAs (see, e.g., x86's integer divide instruction, with inputs in rdx and rax, or a call with an ABI that passes arguments in fixed registers). If the two inputs happen to be given the same program value, here virtual register v0, then we have created an impossible constraint: we require v0 to be in both r0 and r1 at the same time.

As we have formulated the problem, a liverange is in only one place at a time; and in fact this is a very useful simplifying invariant, and a simpler model than "there are N copies of the virtual register at once" (which one(s) are up-to-date, if we allow multiple defs?).

We can "simplify to a previously solved problem" in this case with a neat trick: we keep a side-list of "fixup moves" to add back in, after we complete allocation, and we insert such a fixup move from r0 to r1 just before this instruction. Then we delete the constraint on the second operand that uses v0. The rest of the allocation will proceed as if v0 were only required in r0; it will end up in that location; and the fixup move will copy it to r1 as well.

We perform a similar rewrite for reused-input constraints. These seem as if they would be fairly fundamental to the core allocation loop, because they tie one decision to another; now we have to deal with dependent allocation decisions. But we can do a simpler thing: we edit the liveranges so that (i) the output that reuses the input has a liverange that starts at the early (input) phase, and (ii) the input has a liverange that ends just before the instruction, not overlapping. (In other words, we shift both back by one program point.) Then we insert a fixup move from input to output. The figure below illustrates this rewrite.


      INITIAL
                         v0   v1    v2
                         :    :
                         |    |
                  EARLY use  use
      add v2, v0, v1
                  LATE             def reuse(0)
                                     |
                                     :
                                     
      REWRITTEN
                         v0   v1    v2
                         :    :
                         |    |
                         |    |
                  LATE   |    |
                              |    (implicit copy: v2 := v0)
                  EARLY      use   def
      add v2, v0, v1                 |
                  LATE               |
                                     |
                                     :

One may object that this pessimizes all reused-input allocations -- haven't we removed all knowledge of the constraint, so we will almost always get different registers at input and output, and cause many new moves to be inserted? The answer to this issue comes in the bundle merging, which we discuss below (basically, we try to rejoin the two parts if no overlap would result).

In general, this is a powerful technique: whenever some complexity arises from a constraint or feature, it is best if the complexity can be kept as close to the outer boundary of the system as possible. Rewrites or lowerings into a simpler "core form" are common in compilers, and it so happens that considering regalloc constraints in this light is useful too.⁸

Step 2: Bundles and Merging

Once we have created a list of liveranges with constraints, we could in theory begin to assign locations right away, finding available locations that fulfill constraints and splitting where necessary to do so. However, such an approach would almost certainly run more slowly, and produce worse code, than most state-of-the-art allocators today. Why is that?

A key observation about liveranges in real programs is that there are clusters of related liveranges connected by moves. Several examples are the liveranges on either side of an SSA block parameter (or phi-node), or on either side of a move instruction, or the input and reused-register-constrained output of an instruction.⁹ These liveranges often would benefit if they were in the same register: in all three cases, it would mean one fewer move instruction in the final program.

Processing such related liveranges together, as one unit of allocation, would guarantee that they would be assigned the same location. (If impossible, the merged liveranges could always be split again.) Attaining this result some other way would require reasoning about "affinity" for locations between related liveranges, which is a much more complex question.

Furthermore, processing multiple liveranges together brings all the usual efficiency benefits of batching: the more progress we can make with a single decision, the faster the register allocator runs.

We thus define a "bundle" of liveranges as the unit of allocation. After computing liveranges in the initial input program scan, we merge liveranges into bundles according to a few simple heuristics: across SSA block parameters, across move instructions, and from inputs to outputs of instructions with "reused-input" constraints.

The one key invariant is: all liveranges in a bundle must not overlap. We greedily grow a bundle with the above heuristics, testing at each step whether another liverange can join.

Beyond this point in the allocation process, we will reason about bundles: we enqueue them in the priority workqueue, we process them one at a time and assign locations or split. At the end of the process, we'll scan the liveranges in the bundle and assign each the location that the bundle received.

    CORE ALLOCATION PROBLEM:

         bundle0              bundle1               bundle2          bundle3
           |
           |                    |                                       |
           |                    |
                                |
                                |                     |
                                |                     |
           |
           |                                                            |
           
    ==>
    
           bundle0: r0
           bundle1: r1
           bundle2: r0
           bundle3: r2

Step 3: Assignment Loop and Splitting Heuristics

The heart of the allocator is the main loop that allocates locations to bundles. This is at least conceptually simple: pull a bundle off of a queue, "probe" potential locations one at a time to see if it will fit (has no overlap with points in time for which that location is already reserved), assign it the first place it fits. But there is significant complexity in the details, as always.

The key data structures are: (i) an "allocation map" for each physical register, kept as a BTree for fast lookups, that indicates whether the register is free or occupied at any program point and the liverange that occupies it; and (ii) a queue of bundles to process. (The design document describes several others, such as the second-chance allocation queue and the structures used for stackslots, which we skip here for simplicity.)

The core part of the allocator's processing occurs here: we pull one bundle at a time from the queue and attempt to place it in one of the registers (again we're ignoring stackslot constraints for simplicity).

For each bundle, we can perform one of the following actions:

If we find a register with no overlapping allocations already in place, we can allocate the bundle to the register; then we're done! This is the best case.
Otherwise, we can pick a register where some bundles with a lower "spill cost" (determined as a sum of some heuristic values for each use of a liverange in a bundle) and evict those already-allocated bundles, punting them back to the queue, then put our present bundle in this register instead. We do this only if the present bundle has a higher spill cost.
If this is also not an option, we can split our present bundle into pieces and try again. Heuristically, we find it works well to split at the first conflict point; in other words, allocate as much as would have fit in any register, and then put the remainder back in the queue.

   TO ALLOCATE:                  GIVEN:

       bundle0                     r0     r1    r2       r3
         |                          |b1          |b4
         |                          |            |
         |                                       |
                                    |b2          |
                                    |            |
                                                 |
         |                               |b3     |
         |                               |       |
         
         
    OPTION 1: Take a free register (r3)
    
        - Possible if no overlap. Easiest option!
        
    OPTION 2: Evict, if bundle0's spill cost is higher than evicted bundles
              and if no completely free register exists:
    

       bundle1  bundle2            r0     r1    r2
         |                          |b0          |b4
         |                          |            |
                                    |            |
                 |                               |
                 |                               |
                                                 |
                                    |b0  |b3     |
                                    |    |       |
                 
        (b1 and b2 are re-enqueued)
         
    OPTION 3: Split!
                                      
                                       
                                   r0     r1    r2 
                              -->   |b1   |b0    |b4
                                    |     |      |
                                          |      |
                                    |b2          |
                                    |            |
                                                 |
                              -->   |b0  |b3     |
                                    |    |       |

The presence of eviction as an option is what makes regalloc2 a backtracking allocator. It's not clear why the allocator should always finish its job, if it is allowed to undo work. In fact many bundles may be evicted in order to place just one bundle instead -- isn't this backward progress?

The key to maintaining forward progress is that we only evict bundles of lower spill weight, together with the fact that spill weight monotonically decreases when splitting. Eventually, if bad luck continues far enough, a bundle will be split into individual pieces around each use, and these can always be allocated because (if the input program does not have fundamentally conflicting constraints on one instruction) these single-use bundles have the lowest possible spill weight.

Step 4: Move Handling

Finally, once we have a series of locations assigned to each bundle, we have "solved the problem", but... we still need to convey our solution back to the real world, where a compiler is waiting for us to provide a series of move, load, and store instructions to place values into the right spots.

We split the overall problem into two pieces for the usual simplicity reasons: first, we allow ourselves to cut liveranges into as many pieces as needed, and put each piece in a different place, at a single instruction granularity. We assume that we can edit the program somehow to connect these pieces back up. That allowed the above liverange/bundle processing to become a tractable problem for a solver core to handle. Now, need to connect those liverange fragments. This is the second half of the problem: generating moves.

All-in-One: Liverange Connectors, Program Moves, and Edge Moves

The abstract model for the input to this stage of the allocator is that between each pair of instructions, we perform some arbitrary permutation of liveranges in locations. One way to see this permutation is as a parallel move: a data-movement action that reads values in all of their old locations (inputs of the permutation), then in parallel, writes the values to all of their new locations (outputs of the permutation).

        EARLY
    inst1      r2, r0, r1
        LATE
        
          { r4 := r0 }              <--- regalloc-inserted moves
        
        EARLY
    inst2      r0, r2, r3
        LATE
        
          { r6 := r5, r5 := r6 }    <--- multiple moves in parallel!
                                         (arbitrary permutations)
          
        EARLY
    inst3      r5, r4, r2
        LATE

This is why we make a distinction between the "After" point of instruction i and the "Before" point of instruction i+1, though a traditional compiler textbook would tell you that there is only one program point between a pair of instructions. We have two, and between these two program points lies the parallel move.¹⁰

The process for generating these moves is: we scan liveranges, finding points at which they have been split into pieces where the value must flow from one piece to the next. We also account for CFG edges and block parameters at this point, as well as for move instructions in the input program. Once we have accumulated the set of moves that must happen, in parallel, at a given priority at a given location, we resolve these into a sequence of individual move/load/store instructions using the algorithm we describe in the next section.

One thing to note about this design is that we are handling all value movement in the program with a single resolution mechanism: regalloc-induced movement but also movement that was present in the original program. This is valuable because it allows the moves to be handled more efficiently. In contrast, we have observed issues in the past in allocators that lower moves in stages -- e.g., SSA block parameters to moves prior to regalloc, then regalloc-induced moves during regalloc -- where chains of moves occur because each level of abstraction is not aware of what other levels below or above it are doing.

Parallel Move Resolution

The actual problem of resolving a permutation such as:

    { r0 := r1 ; r1 := r2 ; r2 := r0 }

into a sequence of moves

    scratch := r0
    r0 := r1
    r1 := r2
    r2 := scratch

is a well-studied one, and is known as the "parallel moves problem". The crux of the solution is to understand the permutation as a kind of dependency graph, and sort its moves so that we pull an old value out of a given register before overwriting it. When we encounter a cycle, we can use a scratch register as above.

One might think that something like Tarjan's algorithm for finding strongly-connected components is needed, but in fact there is a nice property of the problem that greatly simplifies it. Because any valid permutation has at most one writer for any given register, we can only have simple cycles of moves, with other uses of old values in the cycle handled before realizing the cyclic move. Some more description is available in our implementation. In fact, this is such a nice observation that we later discovered a paper by Rideau et al. that names the resulting dependency graphs "windmills" for their shape (see figure below -- there can be a simple cycle in the middle, and only acyclic outward moves from cycle elements in a tree of outward shifts) and, delightfully, describes more or less the same algorithm to "tilt at windmills" and resolve the moves.

Figure: "Windmills" in a register movement graph

Scratch Registers and Cycles

The above algorithm works, but has one serious drawback: it requires a scratch register whenever we have a cyclic move. The simplest approach to this requirement is to set aside one register permanently (or actually, one per "register class": e.g., an integer register and a float/vector register). Especially on ISAs with relatively few registers, like x86-64 with 16 each of integer and float registers, this can impact performance by increasing register pressure and forcing more spills.

We thus came up with a scheme to allow use of all registers but still find a scratch when needed for a cyclic move. The approach begins with an idea borrowed from IonMonkey, namely to look for a free register to use as a scratch by actually probing the allocation maps. This often works: the need for a cyclic move doesn't necessarily imply that we will have high register pressure, and so there are often plenty of free registers available.

What if that doesn't work, though? In the above PR, we take another seemingly-simplistic approach: we use a stackslot as the scratch instead! This means that we will resolve the cyclic move into a sequence including stores and loads, but this is fine, because we're already in a situation where all registers are full and we need to spill something.

We're not quite done, though: there is another very important use of the scratch register in a simplistic design, namely to resolve memory-to-memory moves! This arises because our move resolution handles both registers and stackslots in a uniform way, so some cycle elements may be stackslots (memory locations). Using a stackslot as a scratch above just compounds the problem. So we translate, in a separate second phase, memory-to-memory moves into a pair of a load (from memory into scratch) and a store (from scratch into memory).

So to recap, we may find a cyclic move permutation to be necessary, and no registers to be free to use as scratch; so we use a stackslot instead. But some of the original move cycle may have been between stackslots, so we need another scratch to do make these stackslot-to-stackslot moves possible. But we're already out of scratch registers!

The solution to this last issue is that we can do a last-ditch emergency spill of any register, just for the duration of one move. So we pick a "victim" register of the right kind (integer or float), spill it to a second stackslot, use this victim register for a memory-to-memory move (a load and store pair), then reload the victim.

This cascading series of solutions, each a little more complex but a little rarer, is an example of a complexity-for-performance tradeoff. Overall, it is far better to allow the program to use all registers; this will reduce spills. And most parallel moves are not cyclic, so scratch registers are rarely needed anyway. And when a cyclic move is needed, we often have a free register, because this condition is mostly orthogonal to high register pressure. It is only when all of the bad cases line up -- cycle, no free registers, and memory-to-memory moves -- that we reach for the highest-cost approach (decomposing one move into four), and so the most important aspect of this fallback is not that it is fast but that it is correct and can handle all cases.

Everything Else

This has been a not-so-whirlwind tour of the allocator pipeline in regalloc2, but despite my longwindedness, we had to skip many details! For example, the way in which stackslots are allocated for spilled values, the way in which split pieces of a single original bundle share a single spill location ("spill bundles"), the way in which we clean up after move insertion with Redundant Move Elimination (a sort of abstract interpretation that tracks symbolic locations of values), and more, are skipped here but are all described in the design document. One could truly write a book on the engineering of a register allocator, but the above will have to suffice; now, we must move on and draw some lessons!

Four Lessons

Performance

Cache Locality and Scans

One enduring theme in the regalloc2 architecture is data structure design for performance. As I began the project by transliterating IonMonkey code, building Rust equivalents to the data structures in the original C++, I found several things:

The original data structures were heavily pointer-linked. For example, liveranges within bundles and uses within liveranges were kept as linked lists, to allow for fast insertion and removal in the middle, and splicing. A linked list is the classical CS answer to these requirements.
There were quite a few linear-time queries of these data structures. For example, when generating moves between liveranges of a virtual register, a scan would traverse the linked list of these liveranges, observe the range covering one end of a control-flow transition, and do a linear-time scan (through the linked list) for the liverange at the other end!

These two design trends combine to make CPU caches exceptionally unhappy. First there is the algorithmic inefficiency, then there is the cache-unfriendly demand access to random liveranges, each of which is a pointer-chasing scan.

regalloc2 adopts two general themes that work against these problems:

The overall data structure design consists of contiguous-in-memory inline structs rather than linked lists. For example, the list of liveranges in a bundle is a SmallVec<[LiveRangeListEntry; 4]>, i.e. a list with up to four entries inline and otherwise heap-allocated, and the entry struct contains the program-point range inline. Combining this more compact layout with certain invariants -- usually, some sort of sorted-order invariant -- allows for efficient lookups and list merges even without linked-list splicing.
At a higher level, regalloc2 tries to avoid random lookups as much as possible. Sometimes this is unavoidable, but where it is not, a linear scan that produces some output as it goes is much more cache-friendly.

It is worth examining the particular technique we use to resolve moves across control-flow edges. This requires looking up where a virtual register is allocated at either end of the edge -- two arbitrary points in the linear sequence of instructions. The problem is solved in IonMonkey (as we linked above) by scanning over ranges to find basic block ends and then doing a linear-time linked-list traversal to find the "other end", for overall quadratic time.

Instead we scan the liveranges for a virtual register once and produce "half-moves" into a Vec. These "half-moves" are records of either the "source" side of a move, at the origin point of a CFG edge, or the "destination" side of a move, at the destination point of a CFG edge. After our single scan, we sort the list of half-moves by a key (the vreg and destination block) so that the source and destination(s) appear together. We can then scan this list once and generate all moves in bulk.

If that sounds something like MapReduce, that is not an accident: the technique of leveraging a sort with a well-chosen key was invented to allow for efficient parallel computation, and here allows the two "ends" of the move to be processed independently.

This technique provides better algorithmic efficiency, much better cache residency (we have two steps that boil down to "scan input list linearly and produce output list linearly"), and leans on the standard-library implementation of sort(), which is likely to be faster than anything we can come up with. Profiling of regalloc2 runs shows sometimes up to 10% or so of runtime spent in sort(), but this is far better than the alternative, in which we do a random pointer-chasing lookup at every step.

Compact Data

Another lesson learned over and over during regalloc2 optimization is this: data compactness matters! A single struct growing from 16 to 24 bytes could lead to significant slowdowns if a large input leads to allocation and traversals over an array of 10,000 such structs. Every improvement in memory footprint is a reduction in cache misses.

We play many games with bitpacking to achieve this. For example, regalloc2 puts its Operand in 32 bits, and this includes a virtual register number, a constraint, a physical register number to possibly go with that constraint, the position (early/late), kind (def/use), and register class of the operand. Some of this optimization requires compromise: as a result of our encoding scheme, for example, we can allow only 2M (2²¹) virtual registers per function body. But in practice most applications will have other limits that take effect before this matters. (And in any case, many compilers play these same sorts of tricks, so megabytes-large function bodies are problematic in all sorts of ways.) And we sometimes find ways to pack a few more bits (more such PRs are always welcome!).

We play similar tricks with program points, spill weights (we store them as bfloat16 because spill weights need not be too precise, only relatively comparable, and using only 16 bits lets us pack some flags in the upper 16 and save a u32), and more.

Finally, trading off indirection and data-inlining is important: e.g., a LiveRangeList keeps the program-point range (32 + 32 bits) inline, then a 32-bit index to indirect to everything else about the liverange, because checking for bundle overlap is the most common reason for traversing this list and reducing cache misses in this inner loop is paramount.

Reducing Work

One final performance technique that at once both sounds completely obvious and superficial, yet is quite powerful, is: "simply do less work!"

One can often get lost in profiler results, wondering how to shave off some hotspots by compacting some data or reworking some inner-loop logic, only to miss that one is implicitly assuming that the actual computation to be done is invariant. In other words, one might look for the fastest way to compute a particular subproblem or framing of the problem, rather than the ultimate problem at hand (in this case, the register allocation).

In the case of regalloc2, this primarily means that we can improve performance by reducing the number of bundles and liveranges. In turn, this means that we can get outsized wins by improving our merging and splitting heuristics.

Early in the optimization push, I realized that regalloc2 was often finding an abnormally large number of conflicts between bundles, and splitting far too aggressively. It turned out that the liveness analysis was initially approximate, in an intentional, if premature, efficiency tradeoff to avoid a fixpoint loop in favor of a single-pass loop-backedge-based algorithm that overapproximated liveness (which is fine for correctness). The time that this saved was more than offset by the large increase in iterations of the bundle processing loop. So I reworked this into a precise analysis that iterates until fixpoint. It is worthwhile to pay that extra analysis cost upfront to get exact liveness in order to make our lives (and our runtime) better later.

The way in which we compute that precise liveness itself also raises an interesting way of reducing work: by carefully choosing invariants. We perform the liverange-building scan in such a way that we always observe liveranges in (reverse) program order. This lets us build the liverange data structures, which are normally sorted, with simple appends, merging with contiguous sections from adjacent blocks. This is in contrast to the original IonMonkey allocator's equivalent function to add liveranges during analysis, which essentially does an insertion sort and merge, leading to O(n²) behavior. Note that the IonMonkey code has a CoalesceLimit constant that caps the O(n²) behavior at some fixed limit. In contrast our liverange build in regalloc2 is always linear-time.

The final way in which one can reduce work, related to data-structure and invariant choice, is by designing the input (API or data format) correctly in order to efficiently encode the problem. The register allocator that preceded regalloc2, regalloc.rs, did not have a notion of register constraints in instructions' use of virtual registers. Instead, it required the user to use move instructions: reused-input constraints become a move prior to the instruction, and fixed-register constraints become moves to/from physical registers. It then relied on a separate move-elision analysis to try to eliminate these moves. regalloc2 has a smaller input because constraints are carried on every operand. It can still generate these moves when needed, but they often are not. This results in faster allocation as well as often better generated code.

Correctness: "Design for Test" and Fuzzing-First Development

The next set of lessons to come from regalloc2 have to do with how to attain correctness in complex programs.

I believe that regalloc2 is maybe the most intrinsically complex program I have written: its operation relies on many interlocking invariants across the allocation process, and there are many, many edge cases to get right. It is >10K lines of very dense Rust code. There should be approximately zero chance for any human to get this correct, working on real inputs, in any reasonable timeframe. And relying on something this complex to uphold security guarantees that rely on correct compilation should be terrifying.

And yet somehow it seems to work, and we haven't found any miscompiles caused by RA2 itself since we switched Cranelift to use regalloc2 in April. More broadly, there was one issue where constraints generated by Cranelift could not be handled in some cases, resulting in a panic¹¹; and another where spillslots were not reused as they should be, resulting in worse performance; neither could result in incorrect generated code. In the integration of RA2 into Cranelift, there were two bugs that could, but both were found within 24 hours by the fuzzers. (That doesn't mean there won't be any more of course -- but things have been surprisingly boring and quiet!)

The main superpower, if one can call it that, that enabled this to work out is fuzzing. And in particular, a step-by-step approach to fuzzing in which I built fuzzing oracles, test harnesses, and fuzz targets as I built the allocator itself, and drove development with it. Until about 4 months in when I wired up the first version of the Cranelift integration, regalloc2 had only ever performed register allocation for fuzz-target-generated inputs. It still doesn't have a test harness for manually-written tests; there seems to be no need, as the fuzzer is remarkably prescient at finding bugs.

I find it helpful to think of this philosophy in terms of the design-for-test idea from digital hardware design. In brief, the idea is that one builds additional features or interfaces into the hardware specifically so its internal state is visible and it can be tested in controlled, systematic ways.

The first thing that I built in the regalloc2 tree was a function body generator that produces arbitrary control flow, either reducible or irreducible, and arbitrary uses and defs according to what SSA allows. I then built an SSA validator, and finally, fuzzed one against the other. This way I built confidence that I had fuzzing input that included interesting edge cases. This would become an important tool for testing the whole allocator, but it was important to "test the tester" first and cross-check it against SSA's requirements. Of course, checking SSA requires one to compute flowgraph dominance on the CFG, and that can be fuzzed too, using a from-first-principles definition of graph dominance. So the test-tester has itself been tested in this additional way.

Once I had built enough tools with the lower-level tools, and sharpened them all against each other, it was time to write the register allocator itself. Once each major piece was implemented, I first fuzzed it with the SSA function generator to check for panics (assertion failures, mostly). Getting a clean run, given the relatively generous spread of asserts throughout the codebase, gave some confidence that the allocator was doing something reasonable. But to truly be confident that the results were semantically correct answers, we needed to lean more heavily on some program analysis techniques.

In another blog post I detailed our "register allocator checker". In brief, this is a symbolic verification engine that checks that the resulting register allocations produce the same dataflow connectivity as the original, pre-regalloc program. To fully verify regalloc2, I ported the checker over, and drove the whole pipeline -- SSA function generator, allocator, and checker -- with a fuzz target.

This workflow was remarkably (sometimes maddeningly!) effective. I started with a supposedly complete allocator, and ran the fuzzer. Within a few seconds it found a "counterexample" where, according to the checker, regalloc2 produced an incorrect allocation. I built annotation tooling to produce views of the allocator's liveranges and other metadata over the original program. I pored over this and debug-log output of the allocator's various stages, eventually worked out the bug (often some corner-case I had not considered, or sometimes an unexpected interaction between two different parts of the system) and came up with a fix. With the particular fuzz-bug fixed, I started up the main fuzzer again. libFuzzer's startup seems to run over the entire corpus before generating new inputs, so sometimes my bugfixes would quickly cause regressions in other cases I had already handled before. After juggling solutions and finding some way to maintain correctness in all cases, I would let the fuzzer run again, usually finding my next novel fuzzbug within a few minutes.

This was my life for a month or so. Fuzzers, especially over complex programs with strict oracles, are relentless: they leave no rock unturned, they find every bug you could imagine and some you can't, and they accept no excuses. But one day... you run the fuzzer and you find that it keeps running. And running. Three hours later, it's still running. There is no better feeling in the software-engineering universe, and frankly fuzzing with a strong oracle (like symbolic checking or differential execution fuzzing) is probably the second-strongest assurance one will get that one's code is correct (with respect to the "spec" implied by the testcase generator and oracles, mind!) short of actual formal verification. This was the project that changed my opinion on fuzzing from "nice to have supplemental correctness technique" to "the only way to develop complex software".

Compatibility and Migration Path

The last lesson I want to draw from my regalloc2 experience is how one might think about compatibility and migrations, in the context of large "replace a whole unit" updates to software projects.

The regalloc2 effort occurred within the context of the Cranelift project, and was designed primarily for use in Cranelift (though it can be used, and apparently is being used, as a standalone library elsewhere as well). As such, a primary design directive for regalloc2 could be "do whatever is needed to fit into Cranelift's assumptions about the register allocator".

On the other hand, conforming to the imprint left by the last register allocator is a good way to sacrifice a rare chance to explore different corners of the design space. The design of the API of regalloc.rs made in 2020 was quite good for the time -- simple, easy to use, and purpose-built for Cranelift -- but we subsequently learned several lessons. For example, regalloc.rs required the program to be already lowered out of SSA, resulting in somewhat inefficient interactions between blockparam-generated moves and regalloc-generated moves. Ideally we wanted to do something better here.

A timeline for context: regalloc2 proper was working, with its fuzzer as its only client, after about 6 weeks of initial implementation (late March to early May 2021). I cheerfully dove into a refactoring of Cranelift at that point to adapt to the new abstractions.

Less cheerfully after a few weeks of effort, I stopped this direct-port effort at around 547 type errors remaining (having never gotten past a full typecheck). There was simply too much changing all at once, and it was clearly not going to be a reasonable single diff to review or to trust for correctness. I had underestimated how much would have to change; pulling one string loosened three others.

It was clear that some sort of transition would need to happen in multiple stages, so I next built a compatibility shim as a new "algorithm" in regalloc.rs that was a thin wrapper around regalloc2. This involved significant work in regalloc2 to expand its range of accepted inputs: support for non-SSA code, support for "modify" operands as well as uses and defs, and explicit handling of program-level moves with integration into the move generation logic. This was working by August of 2021. Performance results were not as good as initially expected with "native" regalloc2 API usage, but were a promising intermediate step nonetheless.

However, for somewhat complicated reasons, review of that PR stalled, and I spent time in other parts of Cranelift (the ISLE DSL and instruction-selector backends using it). When I eventually came back to RA2, in February 2022, several things had changed: some refactoring (as a result of ISLE) made adaptation to "SSA-like" form in x86 instructions easier, and the enhancements to regalloc2 as part of the regalloc.rs compatibility shim also let us use RA2 directly and migrate away from "modify" operands, moves, etc., in an incremental way.

So I made a second attempt at porting Cranelift to use regalloc2 directly, this time succeeding, to fairly good results. We've been using RA2 since that PR merged in mid-April 2022, about a year after RA2 began.

I learned a few valuable lessons from this saga, but the main one is: incremental migration paths are everything. The above PR may look horribly scary but much of the churn was "semantically boring": RA2 supported, in the end, most of the same abstractions as regalloc.rs, with only blockparam handling changing fundamentally. This is a sort of hybrid of the "compatibility shim" and "direct use of new API" approaches: new API, but supporting a superset of the semantic demands of the old API. One can then migrate single API use-sites at a time away from "legacy semantics" and eventually delete the warts (e.g., "modify" operands in addition to pure uses/defs) if one desires, but that is decoupled from the main atomic switchover. I indeed hope to do such cleanup in Cranelift, in due time.

Along with that, it is useful to think of finite budget for semantic/design-level cleanup per change. Rewrites are opportune times to push a project into a better design-space and benefit from lessons learned, sometimes in ways that would be hard or impossible to do with a truly incremental approach. However, at the margins where the rewrite connects to the outside world, this shift causes tension and so is fundamentally constrained or else has to pull the whole world along with it. I am happy that regalloc2 pulls responsibility for SSA lowering into the allocator; it can be handled more efficiently there. Likewise I am happy that the compatibility-shim effort filled in support for regalloc.rs features that made the rest of the transition easier.

Unending and Unwinnable Nature of Heuristic-Tuning

The final lesson I wish to pull out of this experience is one that has become apparent in the time since the initial transition to RA2: any program that solves an NP-complete problem in a complex way, with a hybridized ball of hundreds of individual heuristics and techniques that somehow works most of the time, is always going to make someone unhappy in some case and at some point unambiguous wins become very hard to find. That is not at all to say that it's not worth continuing attempts at optimization; sometimes improvements do become apparent. But they become much rarer after an initial hillclimb to the top of a "competent implementation of one point in design-space" local maximum.

While looking for more performance, I experimented with many different split heuristics. Especially difficult are splits' relationship to loops: when one has a hot inner loop, one really wants to place a split-point that implies an expensive move (load or store) outside the inner loop. But getting this right in all cases is subtle, because the winning tradeoff depends on register pressure inside the loop, how many values are live across the loop and to the following code, how many uses occur in the loop and how frequently (rare path vs. common path), and so on. In the end, I actually abandoned a number of more complex cost heuristics (an example is in this never-merged commit) and went with several simple heuristics: minimize the cost of the implied move at a split, and explicitly hoist split-points outside of loops. This worked best overall, but did leave a little performance unclaimed in some microbenchmarks.

Sometimes clearer improvements are still possible. One example of a recent investigation: in #3785, we noticed that switching to RA2 had caused an extra move instruction to appear in a particular sequence. This seems minor, but it is always good to understand why it might have occurred and if it points to some deeper issue. After some investigation it became apparent that the splitting heuristics were suboptimal in the particular case of a liverange that spans from a register-constrained use to a stack-constrained use. The details are beyond the scope of this post (thank goodness, it's long enough already!); but empirically I found that trimming liveranges around a split-site in a slightly different way tended to improve results.

So, some changes will be an unmitigated win, but not every tradeoff is so. At the very least, the nature of a register allocator is that one will likely have an unending stream of "could work better in this case" sorts of issues. Can't win 'em all (but keep trying nonetheless!).

Conclusions

We're finally at the conclusions -- thanks to all who have persisted in reading this far!

regalloc2 has been an immensely rewarding project for me, despite (or perhaps because of) the ups-and-downs inherent in building an honest-to-goodness, actually-works, somewhat-competitive-with-peer-compilers register allocator. It was a far larger project than I had anticipated: when I began, I told my manager it would probably be a few weeks to evaluate scope, maybe a month of work total. Witness Hofstadter's Law in action: that is, it will always take longer than you think it will, even when accounting for Hofstadter's Law.

I hope some of the above lessons have been illuminating, and perhaps this post has given some sense of how many interesting problems the register-allocator space contains. It's a well-studied area for at least 40 years now, with countless approaches and clever tricks to learn and to combine in new ways; the work is far from over!

Acknowledgments

Many, many thanks to: Julian Seward and Benjamin Bouvier for numerous discussions about register allocation throughout 2020, and Julian for several followup discussions after regalloc2 started to exist; Julian Seward and Amanieu d'Antras for initial code-review of regalloc2 proper; Amanieu for a number of really high-quality PRs to improve RA2 and add feature support; and Nick Fitzgerald for code-review of the (quite extensive) Cranelift refactoring to use regalloc2. Enormous thanks to Nick for reading over this entire post and providing feedback as well.

which is to say, the original three-part series covered a range of topics summarizing the goals and ideas of Cranelift's new backend design, but we haven't stopped working to improve things since then! The series is now four-thirds complete; by the time I'm done it may be five-thirds or more...

In fact, it is perhaps the most important problem to solve for a fast Wasm-focused compiler, because most other common compiler optimizations will have been done at least to some degree to the Wasm bytecode; register allocation is the main transform that bridges the semantic gap from stack bytecode to machine code.

Other sorts of constraints are possible too; in general, a liverange is constrained by all of the "register mentions" in instructions that touch the liverange's vreg, and we have to satisfy all of these constraints at once. A constraint may be "any register of this kind", or "this particular physical register", or "a slot on the stack", or "the same register as given to another liverange", for example. And beyond constraints, we may have soft "hints" as well, which if followed, reduce the need to move values around.

⁴

regalloc2 supports arbitrary control flow (i.e., does not impose any restrictions on reducibility); its only requirement is that critical edges are split, which Cranelift ensures by construction during lowering.

⁵

Full credit for this idea, as well as most of the constraint design in regalloc2, goes to IonMonkey's register allocator.

⁶

As we'll note under "Lessons" below, during development of a compatibility layer that allowed regalloc2 to emulate regalloc.rs, an earlier register allocator, we actually added a "modify" kind of operand that directly corresponds to the semantics of rax above, namely read-then-written all in one register. We subsequently used it in several places while migrating Cranelift. But for simplicity we hope to eventually remove this (once all uses of it are gone).

⁷

It's actually a sparse bitset that, when large enough, stores a hashmap whose values are contiguous 64-bit chunks of the whole bitset. This is because, for large functions with thousands of virtual registers, keeping thousands of bits per basic block would be impractical. However, the naive sparse approach, where we keep a HashSet<VReg> or equivalent, is also costly because it spends 32 bits per set element (plus load-factor overhead). We observed that the live registers at a given point are often "clustered": there are some long-range live values from early in the function, and then a bunch of recently-defined registers. (This depends also on virtual registers being numbered roughly in program order, which is generally a good heuristic to rely on.) So we have a few u64s and pay the sparse map cost for those, then have a dense map within each 64-bit chunk.

⁸

Credit must go to IonMonkey for this trick as well, though the details of how to edit the liveranges appropriately to get the right interference semantics were far from clear and the path to our current approach was "paved by fuzzbug failures", so to speak.

⁹

Some literature on SSA form calls the connected set of liveranges via phi-nodes or block parameters "webs". Our notion of a bundle encompasses this case but is a bit more general; in principle we can merge any liveranges into a bundle as long as they don't overlap.

¹⁰

Actually, there are up to seven parallel moves between instructions, at priorities according to the way that various constraint edge-cases are lowered. For example, when a single vreg must be placed in multiple physical registers due to multiple uses with different fixed-register constraints, the move that makes this happen occurs at MultiFixedReg priority, which comes after the main inter-instruction permutation (it is logically part of the input setup for the following instruction). And ReusedInput moves happen after that, because any one of the fixed-register inputs could be reused as an input. The detailed reasoning for the order here is beyond the scope of this blogpost, but suffice it to say that the fuzzer helped immensely in getting this ordering right!

¹¹

Pertinent to the broader point about fuzzing, this combination of constraints was not generated by RA2's fuzz target, which is why the resulting corner cases were not seen during development. As soon as the fuzzing testcase generator was extended to do so, the fuzzer found a counterexample within a few seconds, and helped to verify the constraint rewrites in RA2's frontend that fixed this issue.