This post is the first in a three-part series about my recent work on Cranelift as part of my day job at Mozilla. In this first post, I will set some context and describe the instruction selection problem. In particular, I'll talk about a revamp to the instruction selector and backend framework in general that we've been working on for the last nine months or so. This work has been co-developed with my brilliant colleagues Julian Seward and Benjamin Bouvier, with significant early input from Dan Gohman as well, and help from all of the wonderful Cranelift hackers.

Background: Cranelift

So what is Cranelift? The project is a compiler framework written in Rust that is designed especially (but not exclusively) for just-in-time compilation. It's a general-purpose compiler: its most popular use-case is to compile WebAssembly, though several other frontends exist, for example, cg_clif, which adapts the Rust compiler itself to use Cranelift. Folks at Mozilla and several other places have been developing the compiler for a few years now. It is the default compiler backend for wasmtime, a runtime for WebAssembly outside the browser, and is used in production in several other places as well. We recently flipped the switch to turn on Cranelift-based WebAssembly support in nightly Firefox on ARM64 (AArch64) machines, including most smartphones, and if all goes well, it will eventually go out in a stable Firefox release. Cranelift is developed under the umbrella of the Bytecode Alliance.

In the past nine months, we have built a new framework in Cranelift for the "machine backends", or the parts of the compiler that support particular CPU instruction sets. We also added a new backend for AArch64, mentioned above, and filled out features as needed until Cranelift was ready for production use in Firefox. This blog post sets some context and describes the design process that went into the backend-framework revamp.

It can be a bit confusing to keep all of the moving parts straight. Here's a visual overview of Cranelift's place among various other components, focusing on two of the major Rust crates (the Wasm frontend and the codegen backend) and several of the other programs that make use of Cranelift:

Figure: Cranelift and other components

Old Backend Design: Instruction Legalizations

To understand the work that we've done recently on Cranelift, we'll need to zoom into the cranelift_codegen crate above and talk about how it used to work. What is this "CLIF" input, and how does the compiler translate it to machine code that the CPU can execute?

Cranelift makes use of CLIF, or the Cranelift IR (Intermediate Representation) Format, to represent the code that it is compiling. Every compiler that performs program optimizations uses some form of an Intermediate Representation (IR): you can think of this like a virtual instruction set that can represent all the operations a program is allowed to do. The IR is typically simpler than real instruction sets, designed to use a small set of well-defined instructions so that the compiler can easily reason about what a program means. The IR is also independent of the CPU architecture that the compiler eventually targets; this lets much of the compiler (such as the part that generates IR from the input programming language, and the parts that optimize the IR) be reused whenever the compiler is adapted to target a new CPU architecture. CLIF is in Static Single Assignment (SSA) form, and uses a conventional control-flow graph with basic blocks (though it previously allowed extended basic blocks, these have been phased out). Unlike many SSA IRs, it represents φ-nodes with block parameters rather than explicit φ-instructions.

Within cranelift_codegen, before we revamped the backend design, the program remained in CLIF throughout compilation and up until the compiler emitted the final machine code. This might seem to contradict what we just said: how can the IR be machine-independent, but also be the final form from which we emit machine code?

The answer is that the old backends were built around the concept of "legalization" and "encodings". At a high level, the idea is that every Cranelift instruction either corresponds to one machine instruction, or can be replaced by a sequence of other Cranelift instructions. Given such a mapping, we can refine the CLIF in steps, starting from arbitrary machine-independent instructions from earlier compiler stages, performing edits until the CLIF corresponds 1-to-1 with machine code. Let's visualize this process:

Figure: legalization by repeated instruction expansion

A very simple example of a CLIF instruction that has a direct "encoding" to a machine instruction is iadd, which just adds two integers. On essentially any modern architecture, this should map to a simple ALU instruction that adds two registers.

On the other hand, many CLIF instructions do not map cleanly. Some arithmetic instructions fall into this category: for example, there is a CLIF instruction to count the number of set bits in an integer's binary representation (popcount); not every CPU has a single instruction for this, so it might be expanded into a longer series of bit manipulations. There are operations that are defined at a higher semantic level, as well, that will necessarily be lowered with expansions: for example, accesses to Wasm memories are lowered into operations that fetch the linear memory base and its size, bounds-check the Wasm address against the limit, compute the real address for the Wasm address, and perform the access.

To compile a function, then, we iterate over the CLIF and find instructions with no direct machine encodings; for each, we simply expand into the legalized sequence, and then recursively consider the instructions in that sequence. We loop until all instructions have machine encodings. At that point, we can emit the bytes corresponding to each instruction's encoding1.

Growing Pains, and a New Backend Framework?

There are a number of advantages to the legacy Cranelift backend design, which performs expansion-based legalization with a single IR throughout. As one might expect, though, there are also a number of drawbacks. Let's discuss a few of each.

Single IR and Legalization: Pros

  1. By operating on a single IR all the way to machine-code emission, the same optimizations can be applied at multiple stages. For example, consider a legalization expansion that turns a high-level "access Wasm memory" instruction into a sequence of loads, adds and bounds-checks. If many such sequences occur in one function, we might be able to factor out common portions (e.g.: computing the base of the Wasm memory). Thus the legalization scheme exposes as much code as possible, at as many stages as possible, to opportunities for optimization. The legacy Cranelift pipeline in fact works in this way: it runs "pre-opt" and "post-opt" optimization passes, before and after legalization respectively.

  2. If most of the Cranelift instructions become one machine instruction, and few legalizations are necessary, then this scheme can be very fast: it becomes simply a single traversal to fill in "encodings", which were represented by small indices into a table.

Single IR and Legalization: Cons

  1. Expansion-based legalization may not always result in optimal code. So far we've seen that legalization can convert from CLIF to machine instructions with one-to-one or one-to-many mappings. However, there are sometimes also single machine instructions that implement the behavior of multiple CLIF instructions, i.e. a many-to-one mapping. In order to generate efficient code, we want to be able to make use of these instructions.

    For example, on x86, an instruction that references memory can compute an address like base + scale * index, where base and index are registers and scale is 1, 2, 4, or 8. There is no notion of such an address mode in CLIF, so we would want to pattern-match the raw iadd (add) and ishl (shift) or imul (multiply) operations when they occur in the address computation. Then, we would want to somehow select the encoding on the load instruction based on the fact that its input is some specific combination of adds and shifts/multiplies. This seems to break the abstraction that the encoding represents only that instruction's operation.

    In principle, we could implement more general pattern matching for legalization rules to allow many-to-one mappings. However, this would be a significant refactor; and as long as we were reconsidering the design in whole, there were other reasons to avoid patching the problem in this way.

  2. There is a conceptual difficulty with the single-IR approach: there is no static representation of which instructions are expanded into which others and it is difficult to reason about the correctness and termination properties of legalization as a whole.

    Specifically, the expansion-based legalization rules must obey a partial order among instructions: if A expands into a sequence including B, then B cannot later expand into A. In practice, mappings were mostly one-to-one, and for those that weren't, there was a clear domain separation between the "input" high-level instructions and the "machine-level" instructions. However, for more complex machines, or more complex matching schemes that attempt to make better use of the target instruction set, this could become a real difficulty for the machine-backend author to keep straight.

  3. There are efficiency concerns with expansion-based legalization. At an algorithmic level, we prefer to avoid fixpoint loops (in this case, "continue expanding until no more expansions exist") whenever possible. The runtime is bounded, but the bound is somewhat difficult to reason about, because it depends on the maximum depth of chained expansions.

    The data structures that enable in-place editing are also much slower than we would like. Typically, compilers store IR instructions in linked lists to allow for in-place editing. While this is asymptotically as fast as an array-based solution (we never need to perform random access), it is much less cache-friendly or ILP-friendly on modern CPUs. We'd prefer instead to store arrays of instructions and perform single passes over them whenever possible.

  4. Our particular implementation of the legalization scheme grew to be somewhat unwieldy over time. Witness this GitHub issue, in which my eloquent colleague Benjamin Bouvier describes all the reasons we'd like to fix the design: #1141: Kill Recipes With Fire. This is no slight to the engineers who built it; the complexity was managed as well as could be, with a very nice DSL-based code generation step to produce the legalizer from high-level rule specifications. However, reasoning through legalizations and encodings become more cumbersome than we would prefer, and the compiler backends were not very accessible to contributors. Adding a new instruction required learning about "recipes", "encodings", and "legalizations" as well as mere instructions and opcodes, and finding one's way through the DSL to put the pieces together properly. A more conventional code-lowering approach would avoid much of this complexity.

  5. A single-level IR has a fundamental tension: for analyses and optimizations to work well, an IR should have only one way to represent any particular operation, i.e. should consist of a small set of canonical instructions. On the other hand, a machine-level representation should represent all of the relevant details of the target ISA. For example, an address computation might occur in many different ways (with different addressing modes) on the machine, but we would prefer not to have to analyze a special address-computation opcode in all of our analyses. An implicit rule at emission time ("a load with an add instruction as input always becomes this addressing mode") is not ideal, either.

    A single IR simply cannot serve both ends of this spectrum properly, and difficulties arose as CLIF strayed from either end. To resolve this conflict, it is best to have a two-level representation, connected by an explicit instruction selector. It allows CLIF itself to be as simple and as normalized as possible, while allowing all the details we need in machine-specific instructions.

For all of these reasons, as part of our revamp of Cranelift and a prerequisite to our new AArch64 backend, we built a new framework for machine backends and instruction selection. The framework allows machine backends to define their own instructions, separately from CLIF; rather than legalizing with expansions and running until a fixpoint, we define a single lowering pass; and everything is built around more efficient data-structures, carefully optimizing passes over data and avoiding linked lists entirely. We now describe this new design!

A New IR: VCode

The main idea of the new Cranelift backend is to add a machine-specific IR, with several properties that are chosen specifically to represent machine-code well (i.e., the IR is very close to machine code). We call this VCode, which comes from "virtual-register code", and the VCode contains MachInsts, or machine instructions. The key design choices we made are:

  • VCode is a linear sequence of instructions. There is control-flow information that allows traversal over basic blocks, but the data structures are not designed to easily allow inserting or removing instructions or reordering code. Instead, we lower into VCode with a single pass, generating instructions in their final (or near-final) order. I'll write more about how we make this efficient in a follow-up post.

    This design aspect avoids the inefficiencies of linked-list data structures, allowing fast passes over arrays of instructions instead. We've kept the MachInst size relatively small (16 bytes per instruction for AArch64) which aids code generation and iteration speed as well.

  • VCode is not SSA-based; instead, its instructions operate on registers. While lowering, we allocate virtual registers. After the VCode is generated, the register allocator computes appropriate register assignments and edits the instructions in-place, replacing virtual registers with real registers. (Both are packed into a 32-bit representation space, using the high bit to distinguish virtual from real.)

    Eschewing SSA at this level allows us to avoid the overhead of maintaining its invariants, and maps more closely to the real machine. Lowerings for instructions are allowed to, e.g., use a destination register as a temporary before performing a final write into it. If we required SSA form, we would have to allocate a temporary in this case and rely on the register allocator to coalesce it back to the same register, which adds compile-time overhead.

  • VCode is a container for MachInsts, but there is a separate MachInst type for each machine backend. The machine-independent part is parameterized on MachInst (which is a trait in Rust) and is statically monomorphized to the particular target for which the compiler is built.

    Modeling a machine instruction with Rust's excellent facilities for strongly-typed data structures, such as enums, avoids the issue of muddled instruction domain (is a CLIF instruction machine-independent, machine-dependent, or both?) and allows each backend to store the appropriate information for its encoding.

One can visualize a VCode function body as consisting of the following information (simplified; a real example is further below):

Figure: VCode is an array of instructions with block information

Note that the instructions are simply stored in an array, and the basic blocks are recorded separately as ranges of array (instruction) indices. As we described above, we designed this data structure for fast iteration, but not for editing. We always ensure that the first block (b0) is the entry block, and that consecutive block indices have contiguous instruction-index ranges (i.e., are placed next to each other).

Each instruction is mostly opaque from the point of view of the VCode container, with a few exceptions: every instruction exposes its (i) register references, and (ii) basic-block targets, if a branch. Register references are categorized into the usual "uses" and "defs" (reads and writes).2

Note as well that the instructions can refer to either virtual registers (here denoted v0..vN) or real machine registers (here denoted r0..rN). This design choice allows the machine backend to make use of specific registers where required by particular instructions, or by the ABI (parameter-passing conventions). The semantics of VCode are such that the register allocator recognizes live ranges of the real registers, from defs to uses, and avoids allocating virtual registers to those particular real registers for their live ranges. After allocation, all machine instructions are edited in place to refer only to real registers.

Aside from registers and branch targets, an instruction contained in the VCode may contain whatever other information is necessary to emit machine code. Each machine backend defines its own type to store this information. For example, on AArch64, here are several of the instruction formats, simplified:

pub enum Inst {
    /// An ALU operation with two register sources and a register destination.
    AluRRR {
        alu_op: ALUOp,
        rd: Writable<Reg>,
        rn: Reg,
        rm: Reg,
    },

    /// An ALU operation with a register source and an immediate-12 source, and a register
    /// destination.
    AluRRImm12 {
        alu_op: ALUOp,
        rd: Writable<Reg>,
        rn: Reg,
        imm12: Imm12,
    },

    /// A MOVZ with a 16-bit immediate.
    MovZ {
        rd: Writable<Reg>,
        imm: MoveWideConst,
        size: OperandSize,
    },

    /// A two-way conditional branch. Contains two targets; at emission time, a conditional
    /// branch instruction followed by an unconditional branch instruction is emitted, but
    /// the emission buffer will usually collapse this to just one branch. See a follow-up
    /// blog post for more!
    CondBr {
        taken: BranchTarget,
        not_taken: BranchTarget,
        kind: CondBrKind,
    },

    // ...
}

These enum arms could be considered similar to "encodings" in the old backend, except that they are defined in a much more straightforward way. Whereas old Cranelift backends had to define instruction encodings using a DSL, and these encodings were assigned a numeric index and a special bit-packed encoding for additional instruction parameters, here the instructions are simply stored in type-safe and easy-to-use Rust data structures.

We will not discuss the VCode data-structure design or instruction interface much further, except to note that the relevant instruction-emission functionality for a new machine backend can be implemented by providing a MachInst trait implementation for one's instruction type (and then lowering into it; see below). We believe, and early experience seems to indicate, that this is a much easier task than what was required to develop a backend in Cranelift's old DSL-based framework.

Lowering from CLIF to VCode

We've now come to the most interesting design question: how do we lower from CLIF instructions, which are machine-independent, into VCode with the appropriate type of CPU instructions? In other words, what have we replaced the expansion-based legalization and encoding scheme with?

In short, the scheme is a single pass over the CLIF instructions, and at each instruction, we invoke a function provided by the machine backend to lower the CLIF instruction into VCode instruction(s). The backend is given a "lowering context" by which it can examine the instruction and the values that flow into it, performing "tree matching" as desired (see below). This naturally allows 1-to-1, 1-to-many, or many-to-1 translations. We incorporate a reference-counting scheme into this pass to ensure that instructions are only generated if their values are actually used; this is necessary to eliminate dead code when many-to-1 matches occur.

Tree Matching

Recall that the old design allowed for 1-to-1 and 1-to-many mappings from CLIF instructions to machine instructions, but not many-to-1. This is particularly problematic when it comes to pattern-matching for things like addressing modes, where we want to recognize particular combinations of operations and choose a specific instruction that covers all of those operations at once.

Let's start by defining a "tree" that is rooted at a particular CLIF instruction. For each argument to the instruction, we can look "up" the program to find its producer (def). Because CLIF is in SSA form, either the instruction argument is an ordinary value, which must have exactly one definition, or it is a block parameter (φ-node in conventional SSA formulations) that represents multiple possible definitions. We will say that if we reach a block parameter (φ-node), we simply end at a tree leaf -- it is perfectly alright to pattern-match on a tree that is a subset of the true dataflow (we might get suboptimal code, but it will still be correct). For example, given the CLIF code:

block0(v0: i64, v1: i64, v2: b1):
  brnz v2, block1(v0)
  jump block1(v1)

block1(v2: i64):
  v3 = iconst.i64 64
  v4 = iadd.i64 v2, v3
  v5 = iadd.i64 v4, v0
  v6 = load.i64 v5
  return v6

let's consider the load instruction: v6 = load.i64 v5. A simple code generator could map this 1-to-1 to the CPU's ordinary load instruction, using the register holding v5 as an address. This would certainly be correct. However, we might be able to do better: for example, on AArch64, the available addressing modes include a two-register sum ldr x0, [x1, x2] or a register with a constant offset ldr x0, [x1, #64].

The "operand tree" might be drawn like this:

Figure: operand tree for load instruction

We stop at v2 and v0 because they are block parameters; we don't know with certainty which instruction will produce these values. We can replace v3 with the constant 64. Given this view, the lowering process for the load instruction can fairly easily choose an addressing mode. (On AArch64, the code to make this choice is here; in this case it would choose the register + constant immediate form, generating a separate add instruction for v0 + v2.)

Note that we do not actually explicitly construct an operand tree during lowering. Instead, the machine backend can query each instruction input, and the lowering framework will provide a struct giving the producing instruction if known, the constant value if known, and the register that will hold the value if needed. The backend may traverse up the tree (via the "producing instruction") as many times as needed. If it cannot combine the operation of an instruction further up the tree into the root instruction, it can simply use the value in the register at that point instead; it is always safe (though possibly suboptimal) to generate machine instructions for only the root instruction.

Lowering an Instruction

Given this matching strategy, then, how do we actually do the translation? Basically, the backend provides a function that is called once per CLIF instruction, at the "root" of the operand tree, and can produce as many machine instructions as it likes. This function is essentially just a large match statement over the opcode of the root CLIF instruction, with the match-arms looking deeper as needed.

Here is a simplified version of the match-arm for an integer add operation lowered to AArch64 (the full version is here):

match op {
    // ...
    Opcode::Iadd => {
        let rd = get_output_reg(ctx, outputs[0]);
        let rn = put_input_in_reg(ctx, inputs[0]);
        let rm = put_input_in_rse_imm12(ctx, inputs[1]);
        let alu_op = choose_32_64(ty, ALUOp::Add32, ALUOp::Add64);
        ctx.emit(alu_inst_imm12(alu_op, rd, rn, rm));
    }
    // ...
}

There is some magic that happens in several helper functions here. put_input_in_reg() invokes the proper methods on the ctx to look up the register that holds an input value. put_input_in_rse_imm12() is more interesting: it returns a ResultRSEImm12, which is a "register, shifted register, extended register, or 12-bit immediate". This set of choices captures all of the options we have for the second argument of an add instruction on AArch64. The helper looks at the node in the operand tree and attempts to match either a shift or zero/sign-extend operator, which can be incorporated directly into the add. It also checks whether the operand is a constant and if so, could fit into a 12-bit immediate field. If not, it falls back to simply using the register input. alu_inst_imm12() then breaks down this enum and chooses the appropriate Inst arm (AluRRR, AluRRRShift, AluRRRExtend, or AluRRImm12 respectively).

And that's it! No need for legalization and repeated code editing to match several operations and produce a machine instruction. We have found this way of writing lowering logic to be quite straightforward and easy to understand.

Backward Pass with Use-Counts

Now that we can lower a single instruction, how do we lower a function body with many instructions? This is not quite as straightforward as looping over the instructions and invoking the match-over-opcode function described above (though that would actually work). In particular, we want to handle the many-to-1 case more efficiently. Consider what happens when the add-instruction logic above is able to incorporate, say, a left-shift operator into the add instruction. The add machine instruction would then use the shift's input register, and completely ignore the shift's output. If the shift operator has no other uses, we should avoid doing the computation entirely; otherwise, there was no point in merging the operation into the add.

We implement a sort of reference counting to solve this problem. In particular, we track whether any given SSA value is actually used, and we only generate code for a CLIF instruction if any of its results are used (or if it has a side-effect that must occur). This is a form of dead-code elimination but integrated into the single lowering pass.

To know whether a value is used, we simply track a counter per value, initialized to zero. Whenever the machine backend uses a register input (as opposed to using a constant value directly, or incorporating the producing instruction's operation), it notifies the lowering driver that this register has been used.

We must see uses before defs for this to work. Thus, we iterate over the function body "backward". Specifically, we iterate in postorder; this way, all instructions are seen before instructions that dominate them, so given SSA form, we see uses before defs.

Finally, we have to consider side-effects carefully. This matters in two ways. First, if an instruction has a side-effect, then we must lower it into VCode even if its result(s) have no uses. Second, we cannot allow an operation to be merged into another if this would move a side-effecting operation over another or alter whether it might execute. We ensure side-effect correctness with a "coloring" scheme (in a forward pass, assign a color to every instruction, and update the color on every side effect and on every new basic block); the producing instruction is only considered for possible merging with its consuming instruction if it has no side-effects (hence can always be moved) or if it has the same color as the consuming instruction (hence would not move over another side effect).

The lowering procedure is as follows (full version here):

  1. Compute instruction colors based on side-effects.
  2. Allocate virtual registers to all SSA values. It's OK if we don't use some; an unused virtual register will not be allocated any real register.
  3. Iterate in postorder over instructions. If the instruction has a side-effect, or if any of its results are used, call into the machine backend to lower it.
  4. Reverse the VCode instructons so that they appear in forward order. 3

Easy!

Examples

Let's see how this works in real life! Consider the following CLIF code:

function %f25(i32, i32) -> i32 {
block0(v0: i32, v1: i32):
  v2 = iconst.i32 21
  v3 = ishl.i32 v0, v2
  v4 = isub.i32 v1, v3
  return v4
}

We expect that the left-shift (ishl) operation should be merged into the subtract operation on AArch64, using the reg-reg-shift form of ALU instruction, and indeed this happens (here I am showing the debug-dump format one can see with RUST_LOG=debug when running clif-util compile -d --target aarch64):

VCode {
  Entry block: 0
Block 0:
  (original IR block: block0)
  (instruction range: 0 .. 6)
  Inst 0:   mov %v0J, x0
  Inst 1:   mov %v1J, x1
  Inst 2:   sub %v4Jw, %v1Jw, %v0Jw, LSL 21
  Inst 3:   mov %v5J, %v4J
  Inst 4:   mov x0, %v5J
  Inst 5:   ret
}

This then passes through the register allocator, has a prologue and epilogue attached (we cannot generate these until we know which registers are clobbered), has redundant moves elided, and becomes:

stp fp, lr, [sp, #-16]!
mov fp, sp
sub w0, w1, w0, LSL 21
mov sp, fp
ldp fp, lr, [sp], #16
ret

which is a perfectly valid function, correct and callable from C, on AArch64! (We could do better if we knew that this were a leaf function and avoided the stack-frame setup and teardown! Alas, many optimization opportunities remain.)

There are many other examples of interesting instruction-selection cases in our filetests. One of our favorite pastimes lately is to stare at disassemblies and find inefficient translations, improving the pattern-matching as required, so these are slowly getting better (my brilliant colleague Julian Seward has built a custom tool that dumps the hottest basic blocks from a given JIT execution and has found quite a number of improvements in our AArch64 and x86-64 backends).

Next: Efficient Code-Generation Passes, and Checking the Register Allocator

I've covered a lot of ground in this post, but there's still a lot more to say about the new Cranelift backend framework!

In the second post, I'd like to talk about how we designed the passes after VCode lowering to be as efficient as possible. In particular this will involve the way in which we simplify branches, which avoids the more usual step-by-step process of removing empty basic blocks and flipping branch conditions and taking advantage of fallthrough paths, instead doing last-minute edits as the binary code is being emitted (see the MachBuffer implementation for all the details).

Then, in the third post, I'll talk about how I've used abstract interpretation to build a symbolic checker for our register allocator, which has been effective at finding several interesting bugs while fuzzing.

Stay tuned!

In the meantime, for any and all discussions about Cranelift, please feel free to join us on our Bytecode Alliance Zulip chat (here's a topic for this post)!


Thanks to Julian Seward and Benjamin Bouvier for reviewing this post and suggesting several additions and corrections.

1

Note that this description skips several quite important steps that come after instructions have encodings. Most importantly, we still must perform register allocation, which chooses machine registers to hold each value in the IR. This may involve inserting instructions as well, when values need to be spilled to or reloaded from the stack or simply moved between registers. Then, after several other housekeeping tasks (such as resolving branches and optimizing their forms for the actual machine-code offsets), we can actually use the encodings to emit machine code.

2

We also support a "mod" (modify) type of register reference that is both a use and def, while ensuring that the same register is allocated for the use- and the def-points. This replaces an earlier mechanism known as "tied operands" that introduced an ad-hoc constraint to the register allocator. Mods instead are handled by simply extending the live-range through the instruction.

3

The reversal scheme is actually a bit more subtle than this. We want to emit instructions in forward order within the lowering for a single CLIF instruction, but we visit CLIF instructions backward. To make this work, we keep a buffer of lowered VCode instructions per CLIF instruction in forward order; at the end of a single CLIF instruction, these are copied in reverse order to a buffer of lowered VCode instructions for the basic block. Because we visit instructions within the block backward, this buffer contains the VCode sequence for the basic block in reverse order. Then, at the end of the block, we reverse it again onto the tail of the VCode buffer. The end result is that we see VCode instructions in forward order for each CLIF instruction in forward order, contained within basic blocks in forward order (phew!).