A New Backend for Cranelift, Part 1: Instruction Selection
This post is the first in a three-part series about my recent work on Cranelift as part of my day job at Mozilla. In this first post, I will set some context and describe the instruction selection problem. In particular, I'll talk about a revamp to the instruction selector and backend framework in general that we've been working on for the last nine months or so. This work has been co-developed with my brilliant colleagues Julian Seward and Benjamin Bouvier, with significant early input from Dan Gohman as well, and help from all of the wonderful Cranelift hackers.
Background: Cranelift
So what is Cranelift? The project is a compiler framework written in Rust that is designed especially (but not exclusively) for just-in-time compilation. It's a general-purpose compiler: its most popular use-case is to compile WebAssembly, though several other frontends exist, for example, cg_clif, which adapts the Rust compiler itself to use Cranelift. Folks at Mozilla and several other places have been developing the compiler for a few years now. It is the default compiler backend for wasmtime, a runtime for WebAssembly outside the browser, and is used in production in several other places as well. We recently flipped the switch to turn on Cranelift-based WebAssembly support in nightly Firefox on ARM64 (AArch64) machines, including most smartphones, and if all goes well, it will eventually go out in a stable Firefox release. Cranelift is developed under the umbrella of the Bytecode Alliance.
In the past nine months, we have built a new framework in Cranelift for the "machine backends", or the parts of the compiler that support particular CPU instruction sets. We also added a new backend for AArch64, mentioned above, and filled out features as needed until Cranelift was ready for production use in Firefox. This blog post sets some context and describes the design process that went into the backend-framework revamp.
It can be a bit confusing to keep all of the moving parts straight. Here's a visual overview of Cranelift's place among various other components, focusing on two of the major Rust crates (the Wasm frontend and the codegen backend) and several of the other programs that make use of Cranelift:
Old Backend Design: Instruction Legalizations
To understand the work that we've done recently on Cranelift, we'll need to
zoom into the cranelift_codegen
crate above and talk about how it used to
work. What is this "CLIF" input, and how does the compiler translate it to
machine code that the CPU can execute?
Cranelift makes use of CLIF, or the Cranelift IR (Intermediate Representation) Format, to represent the code that it is compiling. Every compiler that performs program optimizations uses some form of an Intermediate Representation (IR): you can think of this like a virtual instruction set that can represent all the operations a program is allowed to do. The IR is typically simpler than real instruction sets, designed to use a small set of well-defined instructions so that the compiler can easily reason about what a program means. The IR is also independent of the CPU architecture that the compiler eventually targets; this lets much of the compiler (such as the part that generates IR from the input programming language, and the parts that optimize the IR) be reused whenever the compiler is adapted to target a new CPU architecture. CLIF is in Static Single Assignment (SSA) form, and uses a conventional control-flow graph with basic blocks (though it previously allowed extended basic blocks, these have been phased out). Unlike many SSA IRs, it represents φ-nodes with block parameters rather than explicit φ-instructions.
Within cranelift_codegen
, before we revamped the backend design, the program
remained in CLIF throughout compilation and up until the compiler emitted the
final machine code. This might seem to contradict what we just said: how can
the IR be machine-independent, but also be the final form from which we emit
machine code?
The answer is that the old backends were built around the concept of "legalization" and "encodings". At a high level, the idea is that every Cranelift instruction either corresponds to one machine instruction, or can be replaced by a sequence of other Cranelift instructions. Given such a mapping, we can refine the CLIF in steps, starting from arbitrary machine-independent instructions from earlier compiler stages, performing edits until the CLIF corresponds 1-to-1 with machine code. Let's visualize this process:
A very simple example of a CLIF instruction that has a direct "encoding" to a
machine instruction is iadd
, which just adds two integers. On essentially any
modern architecture, this should map to a simple ALU instruction that adds two
registers.
On the other hand, many CLIF instructions do not map cleanly. Some arithmetic
instructions fall into this category: for example, there is a CLIF instruction
to count the number of set bits in an integer's binary representation
(popcount
); not every CPU has a single instruction for this, so it might be
expanded into a longer series of bit manipulations. There are operations that
are defined at a higher semantic level, as well, that will necessarily be
lowered with expansions: for example, accesses to Wasm memories are lowered
into operations that fetch the linear memory base and its size, bounds-check
the Wasm address against the limit, compute the real address for the Wasm
address, and perform the access.
To compile a function, then, we iterate over the CLIF and find instructions with no direct machine encodings; for each, we simply expand into the legalized sequence, and then recursively consider the instructions in that sequence. We loop until all instructions have machine encodings. At that point, we can emit the bytes corresponding to each instruction's encoding1.
Growing Pains, and a New Backend Framework?
There are a number of advantages to the legacy Cranelift backend design, which performs expansion-based legalization with a single IR throughout. As one might expect, though, there are also a number of drawbacks. Let's discuss a few of each.
Single IR and Legalization: Pros
-
By operating on a single IR all the way to machine-code emission, the same optimizations can be applied at multiple stages. For example, consider a legalization expansion that turns a high-level "access Wasm memory" instruction into a sequence of loads, adds and bounds-checks. If many such sequences occur in one function, we might be able to factor out common portions (e.g.: computing the base of the Wasm memory). Thus the legalization scheme exposes as much code as possible, at as many stages as possible, to opportunities for optimization. The legacy Cranelift pipeline in fact works in this way: it runs "pre-opt" and "post-opt" optimization passes, before and after legalization respectively.
-
If most of the Cranelift instructions become one machine instruction, and few legalizations are necessary, then this scheme can be very fast: it becomes simply a single traversal to fill in "encodings", which were represented by small indices into a table.
Single IR and Legalization: Cons
-
Expansion-based legalization may not always result in optimal code. So far we've seen that legalization can convert from CLIF to machine instructions with one-to-one or one-to-many mappings. However, there are sometimes also single machine instructions that implement the behavior of multiple CLIF instructions, i.e. a many-to-one mapping. In order to generate efficient code, we want to be able to make use of these instructions.
For example, on x86, an instruction that references memory can compute an address like
base + scale * index
, wherebase
andindex
are registers andscale
is 1, 2, 4, or 8. There is no notion of such an address mode in CLIF, so we would want to pattern-match the rawiadd
(add) andishl
(shift) orimul
(multiply) operations when they occur in the address computation. Then, we would want to somehow select the encoding on theload
instruction based on the fact that its input is some specific combination of adds and shifts/multiplies. This seems to break the abstraction that the encoding represents only that instruction's operation.In principle, we could implement more general pattern matching for legalization rules to allow many-to-one mappings. However, this would be a significant refactor; and as long as we were reconsidering the design in whole, there were other reasons to avoid patching the problem in this way.
-
There is a conceptual difficulty with the single-IR approach: there is no static representation of which instructions are expanded into which others and it is difficult to reason about the correctness and termination properties of legalization as a whole.
Specifically, the expansion-based legalization rules must obey a partial order among instructions: if A expands into a sequence including B, then B cannot later expand into A. In practice, mappings were mostly one-to-one, and for those that weren't, there was a clear domain separation between the "input" high-level instructions and the "machine-level" instructions. However, for more complex machines, or more complex matching schemes that attempt to make better use of the target instruction set, this could become a real difficulty for the machine-backend author to keep straight.
-
There are efficiency concerns with expansion-based legalization. At an algorithmic level, we prefer to avoid fixpoint loops (in this case, "continue expanding until no more expansions exist") whenever possible. The runtime is bounded, but the bound is somewhat difficult to reason about, because it depends on the maximum depth of chained expansions.
The data structures that enable in-place editing are also much slower than we would like. Typically, compilers store IR instructions in linked lists to allow for in-place editing. While this is asymptotically as fast as an array-based solution (we never need to perform random access), it is much less cache-friendly or ILP-friendly on modern CPUs. We'd prefer instead to store arrays of instructions and perform single passes over them whenever possible.
-
Our particular implementation of the legalization scheme grew to be somewhat unwieldy over time. Witness this GitHub issue, in which my eloquent colleague Benjamin Bouvier describes all the reasons we'd like to fix the design: #1141: Kill Recipes With Fire. This is no slight to the engineers who built it; the complexity was managed as well as could be, with a very nice DSL-based code generation step to produce the legalizer from high-level rule specifications. However, reasoning through legalizations and encodings become more cumbersome than we would prefer, and the compiler backends were not very accessible to contributors. Adding a new instruction required learning about "recipes", "encodings", and "legalizations" as well as mere instructions and opcodes, and finding one's way through the DSL to put the pieces together properly. A more conventional code-lowering approach would avoid much of this complexity.
-
A single-level IR has a fundamental tension: for analyses and optimizations to work well, an IR should have only one way to represent any particular operation, i.e. should consist of a small set of canonical instructions. On the other hand, a machine-level representation should represent all of the relevant details of the target ISA. For example, an address computation might occur in many different ways (with different addressing modes) on the machine, but we would prefer not to have to analyze a special address-computation opcode in all of our analyses. An implicit rule at emission time ("a load with an add instruction as input always becomes this addressing mode") is not ideal, either.
A single IR simply cannot serve both ends of this spectrum properly, and difficulties arose as CLIF strayed from either end. To resolve this conflict, it is best to have a two-level representation, connected by an explicit instruction selector. It allows CLIF itself to be as simple and as normalized as possible, while allowing all the details we need in machine-specific instructions.
For all of these reasons, as part of our revamp of Cranelift and a prerequisite to our new AArch64 backend, we built a new framework for machine backends and instruction selection. The framework allows machine backends to define their own instructions, separately from CLIF; rather than legalizing with expansions and running until a fixpoint, we define a single lowering pass; and everything is built around more efficient data-structures, carefully optimizing passes over data and avoiding linked lists entirely. We now describe this new design!
A New IR: VCode
The main idea of the new Cranelift backend is to add a machine-specific IR,
with several properties that are chosen specifically to represent machine-code
well (i.e., the IR is very close to machine code). We call this VCode
, which
comes from "virtual-register code", and the VCode contains MachInst
s, or
machine instructions. The key design choices we made are:
-
VCode is a linear sequence of instructions. There is control-flow information that allows traversal over basic blocks, but the data structures are not designed to easily allow inserting or removing instructions or reordering code. Instead, we lower into VCode with a single pass, generating instructions in their final (or near-final) order. I'll write more about how we make this efficient in a follow-up post.
This design aspect avoids the inefficiencies of linked-list data structures, allowing fast passes over arrays of instructions instead. We've kept the
MachInst
size relatively small (16 bytes per instruction for AArch64) which aids code generation and iteration speed as well. -
VCode is not SSA-based; instead, its instructions operate on registers. While lowering, we allocate virtual registers. After the VCode is generated, the register allocator computes appropriate register assignments and edits the instructions in-place, replacing virtual registers with real registers. (Both are packed into a 32-bit representation space, using the high bit to distinguish virtual from real.)
Eschewing SSA at this level allows us to avoid the overhead of maintaining its invariants, and maps more closely to the real machine. Lowerings for instructions are allowed to, e.g., use a destination register as a temporary before performing a final write into it. If we required SSA form, we would have to allocate a temporary in this case and rely on the register allocator to coalesce it back to the same register, which adds compile-time overhead.
-
VCode is a container for
MachInst
s, but there is a separateMachInst
type for each machine backend. The machine-independent part is parameterized onMachInst
(which is a trait in Rust) and is statically monomorphized to the particular target for which the compiler is built.Modeling a machine instruction with Rust's excellent facilities for strongly-typed data structures, such as
enum
s, avoids the issue of muddled instruction domain (is a CLIF instruction machine-independent, machine-dependent, or both?) and allows each backend to store the appropriate information for its encoding.
One can visualize a VCode function body as consisting of the following information (simplified; a real example is further below):
Note that the instructions are simply stored in an array, and the basic blocks
are recorded separately as ranges of array (instruction) indices. As we
described above, we designed this data structure for fast iteration, but not
for editing. We always ensure that the first block (b0
) is the entry block,
and that consecutive block indices have contiguous instruction-index ranges
(i.e., are placed next to each other).
Each instruction is mostly opaque from the point of view of the VCode container, with a few exceptions: every instruction exposes its (i) register references, and (ii) basic-block targets, if a branch. Register references are categorized into the usual "uses" and "defs" (reads and writes).2
Note as well that the instructions can refer to either virtual registers
(here denoted v0
..vN
) or real machine registers (here denoted
r0
..rN
). This design choice allows the machine backend to make use of
specific registers where required by particular instructions, or by the ABI
(parameter-passing conventions). The semantics of VCode are such that the
register allocator recognizes live ranges of the real registers, from defs to
uses, and avoids allocating virtual registers to those particular real
registers for their live ranges. After allocation, all machine instructions are
edited in place to refer only to real registers.
Aside from registers and branch targets, an instruction contained in the VCode may contain whatever other information is necessary to emit machine code. Each machine backend defines its own type to store this information. For example, on AArch64, here are several of the instruction formats, simplified:
These enum arms could be considered similar to "encodings" in the old backend, except that they are defined in a much more straightforward way. Whereas old Cranelift backends had to define instruction encodings using a DSL, and these encodings were assigned a numeric index and a special bit-packed encoding for additional instruction parameters, here the instructions are simply stored in type-safe and easy-to-use Rust data structures.
We will not discuss the VCode data-structure design or instruction interface
much further, except to note that the relevant instruction-emission
functionality for a new machine backend can be implemented by providing
a MachInst
trait
implementation
for one's instruction type (and then lowering into it; see below). We believe,
and early experience seems to indicate, that this is a much easier task
than what was required to develop a backend in Cranelift's old DSL-based
framework.
Lowering from CLIF to VCode
We've now come to the most interesting design question: how do we lower from CLIF instructions, which are machine-independent, into VCode with the appropriate type of CPU instructions? In other words, what have we replaced the expansion-based legalization and encoding scheme with?
In short, the scheme is a single pass over the CLIF instructions, and at each instruction, we invoke a function provided by the machine backend to lower the CLIF instruction into VCode instruction(s). The backend is given a "lowering context" by which it can examine the instruction and the values that flow into it, performing "tree matching" as desired (see below). This naturally allows 1-to-1, 1-to-many, or many-to-1 translations. We incorporate a reference-counting scheme into this pass to ensure that instructions are only generated if their values are actually used; this is necessary to eliminate dead code when many-to-1 matches occur.
Tree Matching
Recall that the old design allowed for 1-to-1 and 1-to-many mappings from CLIF instructions to machine instructions, but not many-to-1. This is particularly problematic when it comes to pattern-matching for things like addressing modes, where we want to recognize particular combinations of operations and choose a specific instruction that covers all of those operations at once.
Let's start by defining a "tree" that is rooted at a particular CLIF instruction. For each argument to the instruction, we can look "up" the program to find its producer (def). Because CLIF is in SSA form, either the instruction argument is an ordinary value, which must have exactly one definition, or it is a block parameter (φ-node in conventional SSA formulations) that represents multiple possible definitions. We will say that if we reach a block parameter (φ-node), we simply end at a tree leaf -- it is perfectly alright to pattern-match on a tree that is a subset of the true dataflow (we might get suboptimal code, but it will still be correct). For example, given the CLIF code:
block0(v0: i64, v1: i64, v2: b1):
brnz v2, block1(v0)
jump block1(v1)
block1(v2: i64):
v3 = iconst.i64 64
v4 = iadd.i64 v2, v3
v5 = iadd.i64 v4, v0
v6 = load.i64 v5
return v6
let's consider the load instruction: v6 = load.i64 v5
. A simple code
generator could map this 1-to-1 to the CPU's ordinary load instruction, using
the register holding v5
as an address. This would certainly be correct.
However, we might be able to do better: for example, on AArch64, the available
addressing modes include a two-register sum ldr x0, [x1, x2]
or a register
with a constant offset ldr x0, [x1, #64]
.
The "operand tree" might be drawn like this:
We stop at v2
and v0
because they are block parameters; we don't know with
certainty which instruction will produce these values. We can replace v3
with
the constant 64
. Given this view, the lowering process for the load
instruction can fairly easily choose an addressing mode. (On AArch64, the code
to make this choice is
here;
in this case it would choose the register + constant immediate form, generating
a separate add instruction for v0 + v2
.)
Note that we do not actually explicitly construct an operand tree during lowering. Instead, the machine backend can query each instruction input, and the lowering framework will provide a struct giving the producing instruction if known, the constant value if known, and the register that will hold the value if needed. The backend may traverse up the tree (via the "producing instruction") as many times as needed. If it cannot combine the operation of an instruction further up the tree into the root instruction, it can simply use the value in the register at that point instead; it is always safe (though possibly suboptimal) to generate machine instructions for only the root instruction.
Lowering an Instruction
Given this matching strategy, then, how do we actually do the translation?
Basically, the backend provides a function that is called once per CLIF
instruction, at the "root" of the operand tree, and can produce as many machine
instructions as it likes. This function is essentially just a large match
statement over the opcode of the root CLIF instruction, with the match-arms
looking deeper as needed.
Here is a simplified version of the match-arm for an integer add operation lowered to AArch64 (the full version is here):
match op
There is some magic that happens in several helper functions here.
put_input_in_reg()
invokes the proper methods on the ctx
to look up the
register that holds an input value. put_input_in_rse_imm12()
is more
interesting: it returns a
ResultRSEImm12
,
which is a "register, shifted register, extended register, or 12-bit
immediate". This set of choices captures all of the options we have for the
second argument of an add instruction on AArch64. The helper looks at the node
in the operand tree and attempts to match either a shift or zero/sign-extend
operator, which can be incorporated directly into the add. It also checks
whether the operand is a constant and if so, could fit into a 12-bit immediate
field. If not, it falls back to simply using the register input.
alu_inst_imm12()
then breaks down this enum and chooses the appropriate
Inst
arm (AluRRR
, AluRRRShift
, AluRRRExtend
, or AluRRImm12
respectively).
And that's it! No need for legalization and repeated code editing to match several operations and produce a machine instruction. We have found this way of writing lowering logic to be quite straightforward and easy to understand.
Backward Pass with Use-Counts
Now that we can lower a single instruction, how do we lower a function body
with many instructions? This is not quite as straightforward as looping over
the instructions and invoking the match-over-opcode function described above
(though that would actually work). In particular, we want to handle the
many-to-1 case more efficiently. Consider what happens when the
add-instruction logic above is able to incorporate, say, a left-shift
operator into the add instruction. The add
machine instruction would then use
the shift's input register, and completely ignore the shift's output. If the
shift operator has no other uses, we should avoid doing the computation
entirely; otherwise, there was no point in merging the operation into the add.
We implement a sort of reference counting to solve this problem. In particular, we track whether any given SSA value is actually used, and we only generate code for a CLIF instruction if any of its results are used (or if it has a side-effect that must occur). This is a form of dead-code elimination but integrated into the single lowering pass.
To know whether a value is used, we simply track a counter per value, initialized to zero. Whenever the machine backend uses a register input (as opposed to using a constant value directly, or incorporating the producing instruction's operation), it notifies the lowering driver that this register has been used.
We must see uses before defs for this to work. Thus, we iterate over the function body "backward". Specifically, we iterate in postorder; this way, all instructions are seen before instructions that dominate them, so given SSA form, we see uses before defs.
Finally, we have to consider side-effects carefully. This matters in two ways. First, if an instruction has a side-effect, then we must lower it into VCode even if its result(s) have no uses. Second, we cannot allow an operation to be merged into another if this would move a side-effecting operation over another or alter whether it might execute. We ensure side-effect correctness with a "coloring" scheme (in a forward pass, assign a color to every instruction, and update the color on every side effect and on every new basic block); the producing instruction is only considered for possible merging with its consuming instruction if it has no side-effects (hence can always be moved) or if it has the same color as the consuming instruction (hence would not move over another side effect).
The lowering procedure is as follows (full version here):
- Compute instruction colors based on side-effects.
- Allocate virtual registers to all SSA values. It's OK if we don't use some; an unused virtual register will not be allocated any real register.
- Iterate in postorder over instructions. If the instruction has a side-effect, or if any of its results are used, call into the machine backend to lower it.
- Reverse the VCode instructons so that they appear in forward order. 3
Easy!
Examples
Let's see how this works in real life! Consider the following CLIF code:
function %f25(i32, i32) -> i32 {
block0(v0: i32, v1: i32):
v2 = iconst.i32 21
v3 = ishl.i32 v0, v2
v4 = isub.i32 v1, v3
return v4
}
We expect that the left-shift (ishl
) operation should be merged into the
subtract operation on AArch64, using the reg-reg-shift form of ALU instruction,
and indeed this happens (here I am showing the debug-dump format one can see
with RUST_LOG=debug
when running clif-util compile -d --target aarch64
):
VCode {
Entry block: 0
Block 0:
(original IR block: block0)
(instruction range: 0 .. 6)
Inst 0: mov %v0J, x0
Inst 1: mov %v1J, x1
Inst 2: sub %v4Jw, %v1Jw, %v0Jw, LSL 21
Inst 3: mov %v5J, %v4J
Inst 4: mov x0, %v5J
Inst 5: ret
}
This then passes through the register allocator, has a prologue and epilogue attached (we cannot generate these until we know which registers are clobbered), has redundant moves elided, and becomes:
stp fp, lr, [sp, #-16]!
mov fp, sp
sub w0, w1, w0, LSL 21
mov sp, fp
ldp fp, lr, [sp], #16
ret
which is a perfectly valid function, correct and callable from C, on AArch64! (We could do better if we knew that this were a leaf function and avoided the stack-frame setup and teardown! Alas, many optimization opportunities remain.)
There are many other examples of interesting instruction-selection cases in our filetests. One of our favorite pastimes lately is to stare at disassemblies and find inefficient translations, improving the pattern-matching as required, so these are slowly getting better (my brilliant colleague Julian Seward has built a custom tool that dumps the hottest basic blocks from a given JIT execution and has found quite a number of improvements in our AArch64 and x86-64 backends).
Next: Efficient Code-Generation Passes, and Checking the Register Allocator
I've covered a lot of ground in this post, but there's still a lot more to say about the new Cranelift backend framework!
In the second post, I'd like to talk about how we designed the passes after
VCode lowering to be as efficient as possible. In particular this will involve
the way in which we simplify branches, which avoids the more usual step-by-step
process of removing empty basic blocks and flipping branch conditions and
taking advantage of fallthrough paths, instead doing last-minute edits as the
binary code is being emitted (see the
MachBuffer
implementation for all the details).
Then, in the third post, I'll talk about how I've used abstract interpretation to build a symbolic checker for our register allocator, which has been effective at finding several interesting bugs while fuzzing.
Stay tuned!
In the meantime, for any and all discussions about Cranelift, please feel free to join us on our Bytecode Alliance Zulip chat (here's a topic for this post)!
Thanks to Julian Seward and Benjamin Bouvier for reviewing this post and suggesting several additions and corrections.
Note that this description skips several quite important steps that come after instructions have encodings. Most importantly, we still must perform register allocation, which chooses machine registers to hold each value in the IR. This may involve inserting instructions as well, when values need to be spilled to or reloaded from the stack or simply moved between registers. Then, after several other housekeeping tasks (such as resolving branches and optimizing their forms for the actual machine-code offsets), we can actually use the encodings to emit machine code.
We also support a "mod" (modify) type of register reference that is both a use and def, while ensuring that the same register is allocated for the use- and the def-points. This replaces an earlier mechanism known as "tied operands" that introduced an ad-hoc constraint to the register allocator. Mods instead are handled by simply extending the live-range through the instruction.
The reversal scheme is actually a bit more subtle than this. We want to emit instructions in forward order within the lowering for a single CLIF instruction, but we visit CLIF instructions backward. To make this work, we keep a buffer of lowered VCode instructions per CLIF instruction in forward order; at the end of a single CLIF instruction, these are copied in reverse order to a buffer of lowered VCode instructions for the basic block. Because we visit instructions within the block backward, this buffer contains the VCode sequence for the basic block in reverse order. Then, at the end of the block, we reverse it again onto the tail of the VCode buffer. The end result is that we see VCode instructions in forward order for each CLIF instruction in forward order, contained within basic blocks in forward order (phew!).