Compilation of JavaScript to Wasm, Part 3: Partial Evaluation
This is the final post of a three-part series covering my work on "fast JS on Wasm"; the first post covered PBL, a portable interpreter that supports inline caches, the second post covered ahead-of-time compilation in general terms, and this post discusses how we actually build the ahead-of-time compiler backends. Please read the first two posts for useful context!
In the last post, we covered how one might perform ahead-of-time compilation of JavaScript to low-level code -- WebAssembly bytecode -- in high-level terms. We discussed the two kinds of bytecode present in SpiderMonkey: bytecode for JavaScript function bodies (as a 1-to-1 translation from the source JS) and bytecode for the inline cache (IC) bodies that implement each operator in those function bodies. We outlined the low-level code that the AOT compilation of each bytecode would produce. But how do we actually perform that compilation?
Direct (Single-Pass) Compiler Style
It would be straightforward enough to implement a direct compiler from
each bytecode (JS and CacheIR) to Wasm. The most direct kind of
compiler is sometimes called a "template compiler" or (in the context
of JIT engines) a "template JIT": for each source bytecode, emit a
fixed sequence of target code. This is a fairly common technique, and
if one examines the compiler tiers of the well-known JIT compilers
today, one finds implementations such as
this
(implementation of StrictEq
opcode in SpiderMonkey's JS opcode
baseline compiler, emitting two pops, an IC invocation, and a push) or
this
(implementation of a ternary select
Wasm opcode in Wasmtime's Winch
baseline compiler, emitting three pops, a compare, a conditional move,
and a push). SpiderMonkey already has baseline compiler
implementations for JS
opcodes
and CacheIR
opcodes,
abstracted over the
MacroAssembler
allowing for reasonably easy retargeting to different ISAs. Why not
port these backends to produce Wasm bytecode?
Porting SpiderMonkey JIT to Wasm?
This approach has actually been taken, more or less performing a "direct port" to Wasm by emitting Wasm bytecode "in-process" (inside the Wasm module) and then using a special hostcall interface to add this as a new callable function. This works well enough where the hostcall approach is acceptable, but as I discussed last time, a few factors conspire against a direct port (i.e., the ISA target is only part of the problem):
-
For operational reasons (ahead-of-time deployment and instant start) and security reasons (less attack surface, statically-knowable code), Wasm-based platforms often disallow runtime code generation. The "add a new function" hostcall/hook might simply not be available.
-
We might be running very short-lived request handlers or similar, each starting from an identical snapshot: thus, there is no time to "warm up", generate appropriate Wasm bytecode at runtime, and load and run it, even though execution may still be bottlenecked on (many instances of) these short-lived request handlers.
There is another practical reason as well. JITs are famously hard to develop and debug because (i) the code that is actually running on the system does not exist anywhere in the source -- it is ephemeral -- and (ii) it is often tightly integrated with various low-level ABI details and system invariants which can be easy to get wrong. On native platforms, the difficulty of debugging subtle bugs in Firefox (and its SpiderMonkey JIT engine) led Mozilla engineers to develop rr, the well-known open-source reversible (time-travel) debugger. Now imagine developing a JIT that runs within a Wasm module with runtime codegen hooks, without a state-of-the-art debugging capability; in fact, in Wasmtime, which I help to develop, source-level debugging is improving slowly but no debugger for Wasm bytecode-level execution exists at all yet. (We hope to get there someday.) If I am to have any hope of success, it seems like I will need to find another way to nail down the correct semantics of the compiler's output -- either by testing and debugging in another (native) build somehow, or building better tooling.
A New Hope: Single Source of Semantics
Here I come back to PBL: I
already have an
interpreter
that I painstakingly developed for both CacheIR bodies of IC stubs,
and the JS bytecode that invokes them, faithfully upholding all the
invariants of baseline-level execution in SpiderMonkey. In addition,
work on PBL can proceed in a native build as well -- in fact much of
my debugging was with the venerable rr
, just as with any
native-platform SpiderMonkey development. PBL is "just" a portable C++
program, and so all of the work developing it on any platform then
transfers right to the Wasm build as well. PBL embodies a lot of
hard-won work encoding the correct semantics for each opcode,
sometimes not well-documented in the native backends -- for example,
this
tidbit
(far too much time to discover that wrinkle!), or this
one. What's
more, these semantics sometimes change, or new opcodes are added.
Ideally we do not add more locations that encode these semantics than absolutely necessary -- once per tier is already quite a lot. Can we somehow develop (or -- major foreshadowing here -- automatically derive) our compiler from all of the semantics written in direct style in interpreter code?
Let's take a look once more at (slightly simplified) implementations of an "integer add" opcode on a hypothetical interpreted stack machine, and then a baseline compiler implementation of that opcode where the operand stack uses the native machine stack1:
// Interpreter | // Compiler
... | ...
case Opcode::Add: { | void visit_Add(Assembler& asm) {
uint32_t lhs = stack.pop(); | asm.pop(r1);
uint32_t rhs = stack.pop(); | asm.pop(r2);
uint32_t result = lhs + rhs; | asm.add(r1, r2);
stack.push(result); | asm.push(r1);
break; | }
}
...
If you squint enough, you might almost forget whether you're looking at the interpreter (directly executing the semantics) or the compiler (indirectly executing the semantics, by emitting code). This correspondence is not accidental, and the observation that doing-the-thing and emitting-code-to-do-the-thing are so close is at the heart of what is sometimes called staged programming, with (again given sufficient squinting) practical examples in Lisp macros, "two-level functional languages", MetaML, and more modern takes that strive to provide as transparent a syntax as possible. There are even proposals to leverage this correspondence directly when writing JIT compilers (see HolyJit), writing down the semantics once to provide both the interpreter and the compiler tier.
Most prominently in this space, of course, is GraalVM (alternate PDF), a production-grade JIT compiler framework that takes interpreters for arbitrary languages and partially evaluating the interpreter code itself, with the user's program, to produce compiled code. This mind-bending technique is known as the first Futamura projection. (This is the approach we will take too, with some significant differences.)
How does this work?
The First Futamura Projection
Let's say that we have a function f(x, y)
. If we take x
as some
constant C
, we should be able to derive a function f_C(y) = f(C, y)
that eliminates all occurrences of x
(i.e., x
is no longer a
free variable). This is "just" constant propagation, or partial
application, of
the function.
A little more concretely, let's take f
to be an interpreter, with
inputs x
, the program to be interpreted, and y
, the input that the
interpreted program computes on. f(x, y)
is then the result of
running the program. If we could find f_C(y)
, then f_C
would be,
somehow, a combination of the interpreter and the program it
interprets, merged together: a compiled program.
Futamura, in his seminal paper, calls this combination a "projection" and describes three levels of projection. The first is as above: combining an interpreter with its program. The second and third projections are far more exotic: combining the partial evaluator itself with an interpreter, producing a compiler (that can then be applied to a program); or combining the partial evaluator with itself as input, producing a compiler-compiler (that can then be applied to an interpreter, producing a compiler, that can then be applied to an input program, producing compiled output). Mind-bending stuff.
I can now tell you what weval, the WebAssembly partial evaluator, is: it is a tool that derives the first Futamura projection of an interpreter inside a snapshot of a WebAssembly module together with its input (bytecode).
Alright, but how?! How does one "combine" an interpreter with its input?
The thinking that gives rise to the Futamura projections is very
algebraic in nature. In conventional algebra (with addition and
multiplication over the reals, for example, as taught to students
everywhere), we have a notion of "plugging in" certain constants and
simplifying the expression. Given z = 2x + 3y + 4
, we can hold x
constant, say x = 10
, then "partially evaluate" z = 2*10 + 3y + 4 = 3y + 24
. We have produced an output (expression) that fully
incorporates the new knowledge and has no further references to the
input (variable) x
.
An interpreter and its interpreted program are far, far more complex than an algebraic expression and a numeric constant, of course. Let's explore in a bit more detail how one might "plug in" a program to an interpreter and simplify the result.
Practical First Futamura Projection via Context Specialization
A Naive Approach: Constant Propagation
We might start with the observation that, in the compiler-optimization sphere, constant folding looks a lot like the "plug in a value and simplify" mechanism that we know from algebra. In essence, what we want to say is: take an interpreter, specify that the program it interprets is a constant, and (somehow) propagate the consequences of that through the code.
Let's take a look at an interpreter body that interprets a single opcode first:
switch
Let's take pc
to be a constant, and furthermore assume that the
memory that pc
points to (the bytecode) is constant; and for this
example, assume that pc[0] == Opcode::Add
. Let's also say that sp
starts at 1024
. Then we can "constant fold" the whole interpreter by
(i) simplifying the switch-statement, which is now operating on a
constant input (the opcode), and (ii) propagating through any other
constants we know based on initial state, such as sp
:
// at pc[0]: Opcode::Add, with initial sp == 1024.
uint32_t lhs = stack;
uint32_t rhs = stack;
uint32_t result = lhs + rhs;
stack= result;
sp = 1025;
If one squints appropriately, one could claim that we "compiled" the
Add
opcode here: we converted the generic interpreter, with
switch-cases for every opcode and operand-stack accesses parameterized
on current stack depth, to code that is specific to the actual program
we're interpreting.
So is that it? Can we do a first Futamura projection by holding the "bytecode" memory constant, and doing constant folding (including branch folding)? It can't be that easy, right?
The Problem: Lattices and Loops
Indeed, it becomes much more difficult quickly. Consider what happens
as we widen our scope from programs of one opcode to programs of
two opcodes (!). We'll have to update our interpreter to include a
loop, and an update to pc
after it "fetches" each opcode:
while
A compiler's constant-propagation/folding pass, and other passes that
compute properties of program values and then mutate the program
accordingly, usually are built around an iterated dataflow fixpoint
solver2. To modify a program at compile time, we must work with
properties that are always true -- even when a particular expression
or variable in the program might have many different values during
execution. For example, a loop body might increment an index
variable, giving a different value for each iteration of the loop, so
we cannot conclude that the value is constant. We need a way of
merging different possibilities together to find properties that are
true in all cases.
The constant-propagation analysis works by computing properties that
lie in a lattice, a
mathematical structure with values that allow for "merging" (the
"meet" function) and with certain properties around that merging that
make it well-behaved, so that we always arrive at the same single
solution. Typically, on a lattice used by a
constant-propagation/folding analysis, if we see in one instance that
x
is constant K1
, and in another instance that x
is constant
K2
, then we have to move "down the lattice" and "merge" to a fact
that claims nothing at all: "x
has some arbitrary non-constant value
at runtime". This is sometimes called the "meet-over-all-paths"
solution, because we merge possible behavior over all control-flow
paths that could lead to some program point.
Consider the pc
variable in this loop: on the first iteration, it
begins at the start of the bytecode buffer (offset 0
). After one
opcode, it lies at offset 1
. And so on. Can a constant-propagation
pass "bake in" a single constant value for pc
that will be valid for
every iteration of the loop? No: in fact, it is not constant.
Likewise for the bytecode pointed-to by pc
: in the first iteration
of the loop, it may be Add
, and in the second iteration, something
else; there is no single constant opcode that we can propagate into
the switch
statement to simplify the whole interpreter body down to
the specific code for each opcode.
The essence of the problem here is that the first Futamura projection of an interpreter needs to somehow be aware of the interpreter loop. In essence, it needs to "unroll" the loop: it needs to do a separate constant propagation for each opcode.
One might be tempted to build a transform that "traces" across edges,
including the interpreter loop
backedge(s). PyPy's metatracing works something
like this. One runs into two issues though: (i) the compilation is
"loop-centric" (rather than complete compilation of each
function/method), and (ii) merge-points are tricky. Consider an
if-else sequence with two sides that eventually jump back to the same
pc
. Do we keep "tracing" the path on each side or do we detect
reconvergence and stitch the resulting code back together? That is, if
we have the bytecode
we might naively "follow the control flow" in the interpreter,
generating one path that is pc=0,1,2,3,5,6,7
and another that is
pc=0,1,4,5,6,7
:
Note that we have duplicated the "tail" (pc=5
onward) and this code
growth can become exponential if we have, say, a tree of conditionals.
The property that we want is that the CFG of the original bytecode is somehow reflected into the compiled code: take the original structure of the bytecode, for each opcode simulate the execution of one iteration of the interpreter loop (on that opcode), and stitch it together. So given the initial interpreter loop (left) and bytecode being interpreter (right) here:
we want something like the following, where each interpreted opcode is initially "unrolled" into a whole copy of the interpreter loop:
This might seem a bit wild at first -- a whole bunch of copies of the entire interpreter? -- until one sees that this reduces the problem to a previously-solved one above, namely, how to specialize an interpreter for a single opcode. We can take each copy of the interpreter loop and constant-propagate and branch-fold it. The effect is as if we surgically plucked the implementations of each opcode out of the middle of the interpreter CFG and built a compiled version of the bytecode with them:
To make this work, it still seems like we would need some sort of "visibility" into the workings of the interpreter: we would need to tell the transform tool, the first Futamura projector, about our bytecode and its structure, and the interpreter loop that is meant to operate on it, so it could perform this careful surgery.
GraalVM solves this problem in a very simple and direct way: the interpreter needs to be written in terms of GraalVM AST classes, and given uses of those classes, the transform can "pick out" the right parts of the logic, while otherwise being aware of the overall CFG of the interpreted program directly.
I considered but rejected such an approach: I did not want to build a tool that would require rewriting an existing interpreter, or deep surgery to rebuild it in terms of the framework's new abstractions. Such a transform would be very error-prone and hinder adoption. Is there another way to get the transform to "see" the iterations of the interpreter loop separately?
Contexts
The breakthrough for weval came when I realized that context sensitivity, a standard tool in static analysis (e.g. pointer/alias analysis), could allow us to escape the tyranny of meet-over-all-paths collapsing our carefully-curated constant values into a mush of "runtime".
The key idea is to make pc
itself the context. The constant-folding
analysis otherwise doesn't know anything about interpreters; it just
has an intrinsic that means "please process the code that flows from
this point in a different context", and it keeps the "constant value"
state separately for each context.
Given this intrinsic, we can "simply" update the context whenever we
update pc
; so the loop looks something like this:
; // intrinsic visible to the analysis, otherwise a no-op
while
Let's walk through a constant-folding example. We start with pc
at
the beginning of the bytecode (for simplicity let's say pc=0
, though
it's actually somewhere in the heap snapshot), and we know that *pc
is Opcode::Add
. So in the context of pc=0
, we can know that pc
at the top of the loop starts as a constant. We see the switch
, we
branch-fold because the opcode at *pc
is a constant (because pc
is
a constant and it points to constant memory). We trace through the
implementation of Opcode::Add
(and disregard all the other
switch-cases). We increment pc
and see a loop backedge -- isn't this
where the constant-value analysis sees that the pc=0
case (this
iteration) and the pc=1
case (the next iteration) "meet" and we
can't conclude it's a constant at all?
No! Because just before the loop backedge, we updated the
context. As we trace through the code and track constant values, and
follow control-flow edges and propagate that state to their
destinations, we are now in the context of pc=1
. We reach the loop
header again, but in a new context, so nothing collapses to "not
constant" / "runtime"; pc
is actually a known constant
everywhere. In other words, we go from this analysis situation:
where the analysis correctly determines that PC is not constant (the
backedge carries a value pc=1
into the same block that receives
pc=0
from the entry point, so we can only conclude Unknown
), to
this one:
The contexts are thus the way in which we do the code duplication implied earlier (a separate copy of the interpreter body for each PC location, then simplify). However, note that they are completely driven by the intrinsics in the code that the interpreter developer writes: the analysis does not know about the interpreter loop otherwise.
The overall analysis loop processes (context, block)
tuples, and
carries abstract state (known constant, or Unknown
) per (context, SSA value)
tuple. When we perform the transform, we duplicate each
original basic block into a basic block per context, but only on
demand, when the block is reached in some context. Branches in each
block resolve to the appropriate target block in the appropriate
(possibly updated) context. This is key: the interpreter backedge in
the original CFG becomes an edge from "tail block in context i" to
"loop header block in context i+1"; it becomes an edge to the next
opcode's code in the compiled function body. This may be a forward
edge or a backedge, depending on the original bytecode CFG's shape.
One can see this as a sort of tracing through the control flow of the
interpreter, but with one crucial difference: contexts serve as a way
to reconnect merge points. We thus don't get straight-line traces;
rather, when we encounter a point in the interpreter that we could
branch to pc=K1
or pc=K2
, we see a context update to K1
or K2
and a branch to entry
; we emit in the specialized function body an
edge to the (K1, entry)
or (K2, entry)
block, which is the point
in the compiled code that corresponds to the start of the "copy of the
interpreter for that opcode".
We thus get exactly the behavior we outlined as our desired endpoint above, where the bytecode's CFG gets populated with interpreter cases for each opcode, and it "just falls out" of context sensitivity. This is mind-blowing (at least, to me!).
An animation of a worked example for a simple two-opcode loop is available in my talk here.
Generic Mechanism vs. Interpreter-Specific Transform
Seen a certain way, the use of constant-folding plus contexts seems a
little circuitous at first: pc
at the top of the loop is a constant
k
when we're in context pc=k
; isn't that more complex than a
mechanism that "just" understands interpreter PCs directly? Well,
perhaps on the surface, but actually the implications of the
context-sensitivity are fairly deep and make much of the rest of the
analysis "just work" as well. In particular, all values in the loop
body get their own PC-specific context to specialize in, and
context-sensitivity is a fairly mechanical change to make to a
standard dataflow analysis, whereas something specific to
interpreters would likely be a lot more fragile and complex to
maintain.
One way that the robustness of this "compositional" approach becomes
more clear is when considering (and trying to convince oneself of) the
correctness of the approach. An extremely important property of the
"context" is that it is, with respect to correctness (i.e., semantic
equivalence to the original interpreter), purely heuristic. We could
decide to compute some arbitrary value and enter that context at any
point; we could get the PC "wrong", miss an update somewhere, etc. The
worst that will happen is that we have two different constant values
of pc
merge at the loop header, and we can no longer branch-fold and
specialize the interpreter loop body for an opcode. The degenerate
behavior is that we get a copy of the interpreter loop body (with
runtime control flow remaining) for every opcode. That's not great as
an optimization, and of course we don't want it, but it is still
correct. There is nothing that the (minor) modifications to the
interpreter, to add context sensitivity, can get wrong that will break
the semantics.
The weval Transform
We're finally ready to see "the weval transform", which is the heart of the first Futamura projection that weval performs. It actually does the constant-propagation analysis at the same time as the code transform.
The analysis and transform operate on SSA-based IR, which weval obtains from the original Wasm function via my waffle library, and then compiles the SSA IR of the resulting specialized function back to Wasm function bytecode.
The transform can be summarized with the following pseudocode:
Input: - original function in SSA, Output: - specialized function in SSA,
- with some parameters whose semantics are
specified as identical to the
- "constant" or original function,
- "points to constant mem", - given constants for
- and with "update_context" parameters, and
intrinsics added. - assuming "constant" memory
remains the same.
State:
- Map from (context, block_orig) to block_specialized
- Map from (context, value_orig) to value_specialized
- Workqueue of (context, block_orig)
- Abstract state (Constant, ConstantMem or Unknown)
for each (context, value)
- Dependency map from value_specialized to
set((context, block_orig))
- Flow-sensitive state
(context, abstract values for any global state)
per basic-block input
Init:
- Push (empty context, entry block) onto workqueue.
Main loop:
- Pull a (context, block) from the workqueue.
- If a specialized block doesn't exist in the map,
create one.
- For each value in the original block:
- Fetch the abstract values for all operands
(Constant or Unknown).
- Evaluate the operator: if constant input(s) imply
a constant output, compute that.
- Copy either the original operator with translated
input values, or a "constant value" operator,
to the output block.
- Update value map and abstract state.
- If dependency map shows dependency to
already-processed block, enqueue on workqueue
for reprocessing.
- On visiting an update_context intrinsic, update
flow-sensitive state.
- For the block terminator:
- Branch-fold if able based on constant inputs
to conditionals or switches.
- For each possible target,
- update blockparams, meeting into existing
abstract state;
- enqueue on workqueue if blockparam abstract state
changed or if new block;
- rewrite target in branch to specialized block
for this context, block pair.
And that's it!3 Note a few things in particular:
-
The algorithm is never aware that it is operating on "an interpreter", much less any specific interpreter. This is good: we don't need to tightly couple the compiler tooling here with the use-case (a SpiderMonkey interpreter loop), and it means that we can debug and test each separately.
-
The algorithm, at this high level at least, has no special cases or other oddities: it is more or less what one would expect out of a constant-propagation and branch folding pass that operates in a fixpoint, with the only really novel part that it keeps a "current context" as state and parameterizes all lookups in maps on that context. Once one accepts that "duplication of code" (processing the same blocks multiple times, in different contexts) is correct, it's reasonable to convince oneself that the whole algorithm is correct.
-
The algorithm leaves room for other kinds of flow-sensitive state and intrinsics, if we do want to provide more optimizations to the user. (More on this below!)
I am omitting some details around SSA properties -- specifically, the use of values in basic blocks in the original function that are defined in other blocks higher in the domtree (which is perfectly legal in SSA). Because the new specialized function body may have different dominance relationships between blocks, some of these use-def links are no longer legal. The naive approach to this problem (and the one I took at first) is to transform into "maximal SSA" first, in which all live values are carried as blockparams at every block; that turns out to be really expensive, so weval computes a "cut-set" of blocks at which max-SSA is actually necessary based on the locations of context-update intrinsics (intuitively, approximately just the interpreter backedge, but we want the definition of this to be independent of the notion of an interpreter and where it does context updates).
Using weval: "Function Specialization in a Snapshot" Abstraction
So how does one interact with this tool, as an interpreter author?
As noted above, for a Futamura projection, the key abstraction is "function specialization" (or partial evaluation). In the context of weval, this means that the interpreter requests a specialization of some "generic" interpreter function, with some constant inputs (at least the bytecode, presumably!), and gets back a function pointer to another function in return. That specialized function is semantically equivalent to the original, except that it has all of the specified constants (function parameters and memory contents) "baked in" and has been optimized accordingly.
Said another way: the specialized function body will be the compiled form of the given bytecode.
Because the specialization works in the context of data in memory (the
bytecode), it has to happen in the context of running program
state. For example, in SpiderMonkey, there is a class JSFunction
that ultimately points to some bytecode, and we want to produce a new
Wasm function that bakes in everything about that
JSFunction
. Because this object only exists in the Wasm heap after
the Wasm code has run, parsed some JS source, allocated some space on
the heap, and emitted its internal bytecode, we cannot do the
specialization on the "original Wasm" outside the scope of some
particular execution.
Recall from the earlier post that we want a "phase separation": we want a fully ahead-of-time process. We definitely cannot embed a weval transform in a Wasm engine and expose it at runtime. How do we reconcile this need for phasing with weval's requirement to see the heap snapshot and specialize over it?
The answer comes in the Wizer tool, or more generally the notion of snapshotting: we modify the top-level logic of the language runtime so that we can parse source and produce internal bytecode when invoked with some entry point, and set up global state so that an invocation from another entry point actually runs the code. This work has already been done for various runtimes, including SpiderMonkey, because it is a useful transform to "bundle" an interpreter with pre-parsed bytecode and in-memory data structures, even if we don't do any compilation.
weval thus builds on top of Wizer: it takes a Wasm snapshot, with all the runtime's data structures, finds the bytecode and interpreter function, does the specialization, and appends new functions to the Wasm snapshot.
How does weval find what it needs to specialize? The interpreter, inside the Wasm module, makes "weval requests" via an API that (nominally, in simplified terms) looks like:
void ;
where generic
is a function pointer to the generic function,
specialized
is a location on the heap to place a function pointer to
the resulting specialized function, and params
encodes all of the
information to specialize on (including the pointer to bytecode).
Internally, this "request" is recorded as a data structure with a well-known format in the Wasm heap, and is linked into a linked list that is reachable via a well-known global. weval knows where to look in the Wasm snapshot to find these requests.
From the point of view of the guest, these requests are fulfilled
asynchronously: the specialized
function pointer will not be filled
in right away with a new function, but will be at some point in the
future. The interpreter thus must have a conditional when invoking
each function: if the specialized version exists, call that, otherwise
call the generic interpreter body.
Why asynchronously? Because this is how snapshotting appears from the
guest perspective: code runs during Wizer's pre-initialization phase,
the language runtime's frontend parses source and creates bytecode,
and weval requests are created for every function that is "born". It
is only when pre-initialization execution completes, a snapshot of the
heap is taken, and that snapshot is processed by weval, that weval can
append new functions, fill in the function pointers in the heap
snapshot, and write out a new .wasm
file. When that Wasm module is
later instantiated, it "wakes up" again from the snapshot and the
specialized functions suddenly exist. Voilà, compiled code.
Correctness via Semantics-Preserving Specialization
It's worth emphasizing one aspect of this whole design again. In order
to use weval, an interpreter (i) adds some update_context
intrinsics
whenever pc
is updated, and (ii) makes weval requests for function
pointers to specialized code. That's it.
In particular, nothing in the interpreter has to encode the execution semantics of the guest-language bytecode in a way that is only seen during compiled-code execution. Rather, weval'd compiled code and interpreted code execute using exactly the same opcode implementations, by construction. This is a radical simplification in the testing complexity of the whole source-language compilation approach: we can test and debug the interpreter (and running wherever we like, including in a native rather than Wasm build), and we can separately test and debug weval.
How do we test weval? During its development, I added a lockstep execution mode where the "generic" function and "specialized" function both run in a snapshot and we compare the results. If the transform is correct, they should be identical, independent of whatever the interpreter code is doing. Certainly we are running end-to-end tests of weval'd JS code as well for added assurance, but this test factoring was a huge help in the "bring-up" of this approach.
Optimized Codegen: Abstracted Interpreter State
One final suite of optimization ideas extends from the question: how can we make the interpreter's management of guest-language state more efficient?
To unpack this a bit: when compiling any imperative language (at least languages with reasonable semantics), we can usually identify elements of the guest-language state, such as local variables, and map that state directly to "registers" or other fast storage with fixed, static names. (In Wasm bytecode, we use locals for this.) However, when interpreting such a language, usually we have implementations for opcodes that are generic over which variables we are reading and writing, so we have runtime indirection. To make this concrete, if we have a statement
z = x + y;
we can compile this to something like
add z, x, y
in machine code or Wasm bytecode (with add
suitably replaced by
whatever the static or dynamic types require). But if we have a case
for this operator in an interpreter loop, we have to write
regs[z] = regs[x] + regs[y];
and in the compiled interpreter body, this implies many reads and
writes to the regs
array in memory.
The weval approach to compilation-by-specializing-interpreters
directly copies the interpreter cases for each opcode. We can
specialize the register/variable indices x
, y
, and z
in the
above code, but we cannot (or should not) turn memory loads and stores
to an in-memory regs
array into direct values in registers (Wasm
locals).
Why? Not only would this require a fair amount of complexity, it would
often be incorrect, at least in the case where regs
is observable
from other parts of the interpreter. For example, if the interpreter
has a moving garbage collector, every part of the interpreter state
might be subject to tracing and pointer rewriting at any "GC
safepoint", which could be at many different points in the interpreter
body.
Better, and more in the spirit of weval, would be to provide intrinsics that the interpreter can opt into to indicate that some value storage can be rewritten into direct dataflow and memory loads and stores can be optimized out. weval provides such intrinsics for "locals", which are indexed by integers that must be constant during const-prop (i.e., must come directly from the bytecode), and for an "operand stack", so that interpreters of stack VM-based bytecode (as SpiderMonkey's JS opcode VM is) can perform virtual pushes and pops. weval tracks the abstract state of these locals and stack slots, and "flushes" the state to memory only on a "synchronize" intrinsic, which the interpreter uses when its state may be externally observable (or never, if its design allows that). These intrinsics provided a substantial speedup in SpiderMonkey's case.
Other Optimizations and Details
I've elided a fair number of other details, of course! Worthy of note are a few other things:
-
The weval tool has top-level caching per "specialization request". It takes a hash of the input Wasm module and the "argument string", which includes the bytecode and other constant values, of each specialization request. If it has seen the exact input module and a particular argument string before, it copies the resulting Wasm function body directly out of the cache, as fast as SQLite and the user's disk can go. This turns out to be really useful for the "AOT ICs" corpus mentioned in the previous post.
-
I added special handling to elide the remnants of LLVM's shadow stack in function bodies where constant-propagation and specialization removed all uses of data on the shadow stack. This technically removes "side-effects" (updates to the shadow stack pointer global) but only if they are unobserved, as determined by an escape analysis.
-
Some kinds of patterns result from the partial evaluation that can be cleaned up further. In particular, when an initial value (such as a
pc
orsp
) isUnknown
(not constant or known at compile time), but offsets from that value are constant properties of program points in the bytecode, then we see chains ofpc' = pc + 3
,pc'' = pc' + 4
,pc''' = pc'' + 1
, etc. We can rewrite all of these in terms of the original non-constant value, in essence a re-association of(x + k1) + k2
tox + (k1 + k2)
, which removes a lot of extraneous dataflow from the fastpath of compiled code. -
switch
statements in the source language result in user-data-dependent "next contexts" andpc
values; what is needed is to instead "value-specialize", i.e. split into sub-contexts each of which assumesi = 0
,i = 1
, ...,i >= N
, generate a Wasm-level switch opcode in the output, and constant-propagate accordingly. This is how we can turn arbitrary-fanout control flow in the source bytecode directly into the same control flow in Wasm.
Results: Annotation Overhead and Runtime Speedups
weval is a general tool, usable by any interpreter; to evaluate it, we'll need to consider it in the consequence of particular interpreters. Because it was developed with the SpiderMonkey JS engine in mind, and and for my work to enable fully ahead-of-time JS compilation, I'll mainly present results in that context
Annotation Overhead: the PR that adds weval
support to
SpiderMonkey shows a total lines-of-code delta of +1045 -2
-- a
little over a thousand lines of code -- though much of this is the
vendored weval.h
(part of weval proper, imported into the tree), and
a C++ RAII wrapper around weval requests (reusable in other
interpreters). The actual changes to the interpreter loop come in at
133 lines of
code
in an alternative set of macro definitions -- and, if we're being
fair, the earlier mechanical changes to use these macros to access and
update interpreter state rather than direct code. Then there is a
little bit of plumbing to create weval requests when a function is
created (EnqueueScriptSpecialization
and
EnqueueICStubSpecialization
), about a hundred lines total including
the definitions of those functions.
As an additional demonstration of ease-of-use, one might consider Max Bernstein's post, in which Max and I wrote a toy interpreter (10 opcodes) and weval'd it in a few hours.
Runtime Speedup: In my earlier post I presented results showing overall 2.77x geomean speedup, with up to 4.39x on the highest benchmark (over 4x on 3 of 13). This is the work of significant optimization in SpiderMonkey too -- this speedup does not come for free on day one of a weval-adapted interpreter -- but the results are absolutely enabled by AOT compilation, with the weval'd PBL interpreter body running as an interpreter providing only 1.27x performance over the generic interpreter. This means that weval's compilation via the first Futamura projection, as well as lifting of interpreter dataflow out of memory to SSA, and other miscellaneous post-processing optimization passes, are responsible for a 2.19x speedup (and that part truly is "for free").
Overall, I'm pretty happy with these results; and we are working on shipping the use of weval to AOT-compile JS in the Bytecode Alliance (see here).
Related Work
GraalVM
Finally, to place weval in a broader context: in addition to the foundational work on Futamura projections, weval is absolutely inspired by GraalVM, the only other significant (in fact far more significant) compiler based on partial-evaluation-of-interpreters to my knowledge. weval is in broad strokes doing "the same thing" as GraalVM, but makes a few (massive) simplifications and a few generalizations as well:
-
weval targets WebAssembly, a general-purpose low-level compiler target that can support interpreters written in many languages (e.g., C and C++, covering a majority of the most frequently-used scripting language interpreters and JavaScript interpreters today), while GraalVM targets the Java Virtual Machine.
-
weval is fully ahead-of-time, while classically at least, GraalVM is closely tied with the JVM's JIT and requires runtime codegen and (infamously) long warmup times. More recent versions of GraalVM also support GraalVM Native Image, though GraalVM still carries complexity due to its support for, e.g., runtime de-optimization.
-
weval requires the interpreter to add a few intrinsics (details below) but otherwise "meets interpreters where they are", allowing adaptation of existing industrial-grade language implementations (such as SpiderMonkey!), while GraalVM requires the interpreter to be written in terms of its own AST classes.
-
On the other hand, weval, as an AOT-only transform, does not support any speculative-optimization or deoptimization mechanisms, or gradual warmup, vastly simplifying the design as compared to GraalVM but also limiting performance and flexibility (for now).
porffor, Static Hermes, Static TypeScript, AssemblyScript
There have been a variety of ahead-of-time compilers that accept either JavaScript (Hopc), porffor) or annotated JS (Static Hermes, Static TypeScript), or a JavaScript/TypeScript-like language (AssemblyScript).
Manuel Serrano in his DLS'18 paper describes building an ahead-of-time compiler, Hopc, for JS that takes a simple approach to specialization without runtime type observations: it builds one specialized version of each function alongside the generic version, and uses a set of type-inference rules to pick the most likely types. This is shown to work quite well in the paper. The main downside is the limit to how far this inference can go: for example, SpiderMonkey's ICs can specialize property accesses on object shape, while Hopc does not infer object shapes and in general such an analysis is hard (similar to global pointer analysis).
The porffor JS compiler is (as best I can tell!) emitting "direct" code for JS operators, with some type inference to reduce the number of cases needed. The speedups on small programs are compelling, and I think that taking this approach from scratch actually makes a lot of sense. That said, JS is a large language with a lot of awkward corner-cases, and with (as of writing) 39% of the ECMA-262 test suite passing, porffor still has a hill to climb to reach parity with mature engines such as SpiderMonkey or V8. I sincerely wish them all the best of luck and would love to see the project succeed!
Static Hermes and Static TypeScript both adopt the idea: what if we require type annotations for compilation? In that case, we can dispense with the whole idea of dynamic IC dispatch -- we know which case(s) will be needed ahead of time and we can inline them directly. This is definitely the right answer if one has those annotations. More modern codebases tend to have more discipline in this regard; unfortunately, the world is filled with "legacy JS" and so the idea is not universally applicable.
AssemblyScript is another design point in that same spirit, though further: though it superficially resembles TypeScript, its semantics are often a little different, in a way that makes the language simpler (to the implementer) overall but creates major compatibility hurdles in any effort to port existing TypeScript. For example, it does not support function closures (lambdas), iterators or exceptions, all regularly used features in large JS/TS codebases today.
Compared to all of the cases above, the approach we have taken here -- adapting SpiderMonkey and turning a 100%-coverage interpreter tier into a compiler with weval -- provides full compatibility with everything that the SpiderMonkey engine supports, i.e., all recent JavaScript features and APIs, without requiring anything special.
Conclusion: Principled Derivation of a JIT
weval has been an incredibly fun project, and honestly, that it is working this well has surprised me. GraalVM is certainly an existence proof for the idea, but it also has engineer-decades (perhaps centuries?) of work poured into it; it was a calculated risk I took to follow my intuitive sense that one could do it simpler if focused on the AOT-only case, and if leveraging the simplicity and full encapsulation of Wasm as a basis for program transforms. I do believe that weval was the shortest path I could have taken to produce an ahead-of-time compiler (that is, no warmup or runtime codegen) for JavaScript. This is largely for the reason that it let me factor out the "semantics of the bytecode execution" problem, i.e., writing the PBL interpreter, from the code generation problem; get the former right with a native debugger and native test workflow; and then do a principled transform to get a compiler out of it.
This is the big-picture story I want to tell as well: we can "factor complexity" by replacing manual implementations of large, complex systems with automatic derivations in some cases. Likely it won't ever be completely as good as the hand-written system; but it might get close, if one can find the right abstractions to express all the right optimizations. Profile-guided inlining of ICs to take AOT-compiled (weval'd) JS code to the next performance level is another good example, and in fact it was inspired by WarpMonkey which took the same "principled derivation from one single source of truth" approach to building a higher-performance tier. There is a lot more here to do.
Thanks to Luke Wagner, Nick Fitzgerald, and Max Bernstein for reading and providing feedback on a draft of this post!
In reality, native baseline compilers often keep a "virtual stack" that allows avoiding some push and pop operations -- basically by keeping track of which registers represent the values on the top of the stack whose pushes have been deferred, and using them directly rather than popping when possible, forcing a deferred "synchronization" only when the full stack contents have to be reified for, say, a call or a GC safepoint. Both SpiderMonkey's baseline compiler and Wasmtime's Winch implement this optimization, which can be seen as a simple form of abstract interpretation.
For more on this topic, these slides are a good introduction. The online book Static Program Analysis by Møller and Schwartzbach is also an invaluable resource, as well as the classic Dragon Book (Aho, Lam, Sethi, Ullman).
"Simple", right?! The real implementation is about 2400 lines of code, which is actually not terrible for something this powerful, IMHO. I'm personally shocked how "small" weval turned out to be.