Compilation of JavaScript to Wasm, Part 2: Ahead-of-Time vs. JIT

This is a continuation of my "fast JS on Wasm" series; the first post covered PBL, a portable interpreter that supports inline caches, this post adds ahead-of-time compilation, and the final post will discuss the details of that ahead-of-time compilation. Please read the first post first for useful context!

The most popular programming language in the world, by a wide margin -- thanks to the ubiquity of the web -- is JavaScript (or, if you prefer to follow international standards, ECMAScript per ECMA-262). For a computing platform to be relevant to many modern kinds of applications, it should run JavaScript.

For the past four years or so, I have been working on WebAssembly (Wasm) tooling and platforms, and in particular, running Wasm "outside the browser" (where it was born), using it for strong sandboxing of untrusted server-side code instead.

This blog post will describe my work, over the past 1.5 years, to build an ahead-of-time compiler from JavaScript to WebAssembly bytecode, achieving a 3-5x speedup. The work is now technically largely complete: it has been integrated into the Bytecode Alliance's version of SpiderMonkey (to be eventually upstreamed, ideally), then our shared StarlingMonkey JS-on-Wasm runtime on top of that (available with the --aot flag to the componentize.sh toplevel build command), then my employer's JS SDK built on top of StarlingMonkey. It passes all "JIT tests" and "JS tests" in the SpiderMonkey tree, and all Web Platform Tests at the runtime level. It's still in "beta" and considered experimental, but now is a good time to do a deep-dive into how it works!

This JavaScript AOT compilation approach is built on top of my Portable Baseline Interpreter work in SpiderMonkey, combined with my weval partial program evaluator to provide compilation from an interpreter body "for free" (using a Futamura projection). I have recently given a talk about this work. I hope to go into a bit more depth in this post and a followup one; first, how one can compile JavaScript ahead-of-time at all, and to come later, how weval enables easier construction of the necessary compiler backends.

Background

How do we run JavaScript (JS) on a platform that supports WebAssembly? At first, this question was needless, because Wasm originated within existing JS engines as a way to provide lower-level, strongly-typed code directly to the engine's compiler backend: so one could run JS code alongside any Wasm modules and they could interact at a function-call level. But on Wasm-first platforms, "outside the browser" as we say, we have no system-level JS engine; we can only run Wasm modules that have been uploaded by the user, which interact with the platform via direct "hostcalls" available as Wasm imports.

To quote my earlier blog post on this topic, by adopting this restriction, we obtain a number of advantages:

Running an entire JavaScript engine inside of a Wasm module may seem like a strange approach at first, but it serves real use-cases. There are platforms that accept WebAssembly-sandboxed code for security reasons, as it ensures complete memory isolation between requests while remaining very fine-grained (hence with lower overheads). In such an environment, JavaScript code needs to bring its own engine, because no platform-native JS engine is provided. This approach ensures a sandbox without trusting the JavaScript engine's security -- because the JS engine is just another application on the hardened Wasm platform -- and carries other benefits too: for example, the JS code can interact with other languages that compile to Wasm easily, and we can leverage Wasm's determinism and modularity to snapshot execution and then perform extremely fast cold startup.

So we have very fine-grained isolation and security, we eliminate JIT bugs altogether (the most productive source of CVEs in production browsers today!), and the modularity enables interesting new system design points.

Unfortunately, the state of the art for bundling JS "with its own engine" today is, still, to combine an interpreter with a bytecode representation of the JS, without any compilation of the JS to specialized "native" (or in this case Wasm) code. There are two reasons. The first and most straightforward one is that Wasm is simply a new platform; JS engines' interpreters can be ported relatively straightforwardly, because they have little dependence on the underlying instruction set architecture/platform, but JIT compilers very much do.

Compiling JS: Why is it Hard?

It's worth reviewing first why this is a hard problem. It's possible to compile quite a few languages to a Wasm target today: C or C++ (with wasi-sdk or Emscripten, for example), Rust (with its first-class Wasm support), Swift, Kotlin, Scala, Haskell, OCaml, and probably many more. What makes JavaScript different?

The main difficulty comes from JS's dynamic typing: because variables are not annotated or constrained to a single primitive type or object class, a simple expression like x + y could mean many different things. This could be an integer or floating-point numeric addition, a string concatenation, or an invocation of arbitrary JS code (via .toString(), .valueOf(), proxy objects, or probably other tricks too). A naive compiler would generate a large amount of code for this simple expression that performs type checks and dispatches to one of these different cases. For both runtime performance reasons (checking for all cases is quite slow) and code-size reasons (can we afford hundreds or thousands of Wasm instructions for each JS operator?), this is impractical. We will need to somehow adapt the techniques that modern JS engines have invented to the Wasm platform.

This leads us directly to the engineering marvel that is the modern JIT (just-in-time) compiler for JavaScript. These engines, such as SpiderMonkey (part of Firefox), V8 (part of Chromium), and JavaScriptCore (part of WebKit), generate native machine code for JavaScript code. They work around the above difficulty by generating only code for the cases that actually occur, specializing based on runtime observations.

These JITs work fantastically well today on native platforms, but there are technical reasons why a Wasm-based platform is a poor fit for current JS engines' designs.

Unique Characteristics of a Wasm Platform

A typical "Wasm-first" platform -- whether that be Wasmtime running server-side, serving requests with untrusted handler code, or a sandboxed plugin interface inside a desktop application, or an embedded system -- has a few key characteristics that distinguish it from a typical native OS environment (such as that seen by a Linux x86-64 process).

No Runtime Codegen

First, these platforms typically lack any dynamic/runtime code-generation mechanism: that is, unlike a native process' ability to JIT-compile new machine code and run it, a Wasm module has no interface by which it can add new code at runtime. In other words, code-loading functionality is outside the sandbox, typically by a "deployment" or "control plane" functionality on the platform.¹ This allows greater management flexibility for the platform: for example, it allows deployment of pre-compiled machine code for all Wasm modules from a central control plane, allowing "instant start" when a module is instantiated to serve a request. It also brings significant security benefits: all code that is executing is known a-priori, which makes security exploits harder to hide.

The major downside of this choice, of course, is that one cannot implement a JIT in the traditional way: one cannot generate new code based on observed behavior at runtime, because there is simply no mechanism to invoke it.

Protected Call Stack and No Primitive Control Flow

Second, Wasm itself has some interesting divergences from conventional instruction-set architecture (ISA) design. While a typical ISA on a conventional CPU provides "raw" branch and call instructions that transfer control to addresses in the main memory address space, WebAssembly has first-class abstractions for modules and functions within those modules, and all of its inter-function transfer instructions (calls and returns) target known function entry points by function index and maintain a protected (unmodifiable) call-stack. The main advantage of this design choice is that a "trusted stack" allows for function-level interop between different, mutually untrusting, modules in a Wasm VM, potentially written in different languages; and when Wasm-level features such as exceptions are implemented and used widely, seamless interop of those features as well. In essence, it is an enforced ABI. (My colleague Dan Gohman has some interesting thoughts about this in this talk, at 13 minutes in.)

This is great for language interop and for security, but it rules out a number of interesting control-flow patterns that language runtimes use to implement features such as exceptions, generators/coroutines, pluggable "stubs" of optimized code, patchable jump-points, and dynamic transfer between different optimization levels of the same code ("on-stack replacement"). In other words, it conflicts with how language runtimes want to implement their ABIs.

Per-Request Isolation and Lack of Warmup

Third, and specific to some of the Wasm-first platforms we care about, Wasm's fast instantiation allows for some very fine-grained sandboxing approaches. In particular, when a new Wasm instance takes only a few microseconds to start from a snapshot, it becomes feasible to treat instances as disposable, and use them for small units of work. In the platform I work on, we have an instance-per-request isolation model: each Wasm instance serves only one HTTP request, and then goes away. This is fantastic for security and bug mitigation: the blast radius of an exploit or guest-runtime bug is only a single request, and can never see the data from other users of the platform or even other requests by the same user.

This fine-grained isolation is, again, great for security and robustness, but throws a wrench into any language runtime's plan that begins with "observe the program for a while and...". In other words, we cannot implement a scheme for a dynamically-typed language that requires us to specialize based on observation (of types, or dynamic dispatch targets, or object schemas, or the like) because by the time we make those observations for one request, our instance is disposed and the next request starts "fresh" from the original snapshot.

What are we to do?

Making JS Fast: Specialized Runtime Codegen

Let's first recap how JavaScript engines's JITs typically work -- at a high level, how they observe program behavior, and compile the JS into machine code based on that behavior.

Inline Caches

As I described in more detail in my earlier post about PBL, the key technique that all modern JS engines use to accelerate a dynamically-typed language is to observe execution -- with particular care toward actual types (var x is always a string or an integer, for example), object shapes (x always has properties x.a, x.b and x.c and we can store them in memory in that order), or other specialized cases (x.length always accesses the length slot of an array object) -- and then generate specialized code for the actually-observed cases. For example, if we observe x is always an array, we can compile x.length to a single native load instruction of the array's length in its object header; we don't need to handle cases where x is some other type. This specialized codegen is necessarily at runtime, hence the name for the technique, "just-in-time (JIT) compilation": we generate the specialized code while the program is running, just before it is executed.

One could imagine building some ad-hoc framework to collect a lot of observations ("type feedback") and then direct the compiler appropriately, and in fact you can get pretty far building a JIT this way. However, SpiderMonkey uses a slightly more principled approach: it uses inline caches for all type-feedback and other runtime observations, and encodes the observed cases in a specialized compiler IR, called CacheIR.

The basic idea is: at every point in the program where there could be some specialized behavior based on what we observe at runtime, we have an "IC site". This is a dynamic dispatch point: it invokes some stub, or sequence of code, that has been "attached" to the IC site, and we can attach new stubs as we execute the code. We always start with a "fallback stub" that handles every case generically, but we can emit new stubs as we learn. There is a good example of IC-based specialization in my earlier post, summarized with this figure:

Figure: Inline-cache stubs in a JavaScript function

Compilation Phasing for ICs

So the question is now: how do we adapt a system that fundamentally generates new code at runtime -- in this case, in the form of IC stub sequences, which encode observed special cases ("if conditions X and Y, do Z") -- to a Wasm-based platform that restricts the program to its existing, static code?

Let's ask a basic question: are IC bodies really always completely novel, or do we see the same sequences over and over? Intuitively, one would expect a few common sequences to dominate most programs. For example, the inline cache sequence for a common object property access is a few CacheIR opcodes (simplifying a bit, but not too much, from the actual code):

Check that the receiver (x inx.a) is an object;
Check that its "shape" (mapping from property names to slots) is the same;
Load or store the appropriate slot.

One would expect nearly any real JavaScript program to use that IC body at some point to set an object property. Can't we include it in the engine build itself, avoiding the need to compile it at runtime?

We could special-case "built-in ICs", but there's a more elegant approach: retain the uniform CacheIR representation (we always generate IR for cases we observe), but then look up the generated CacheIR bytecode sequence in a table of included precompiled ICs.

How do we decide which ICs to include, and do we need to write them out manually? Well, not necessarily: it turns out that computers are great at automating boring tasks, and we could simply collect all ICs ever generated during a bunch of JavaScript execution (say, the entire testsuite for SpiderMonkey) and build them in. Lookup tables are cheap, and ICs tend to be small in terms of compiled code size, so there isn't much downside to including a few thousand ICs (2367 by latest count).

This is the approach I took with AOT ICs: I built infrastructure to compile ICs into the engine, loading them into the lookup table (pre-existing to deduplicate compiled IC bodies) at startup, and built an enforcing mode to allow for gathering the corpus.

The latter part -- collecting the corpus, and testing that the corpus is complete when running the whole testsuite -- is key, because it answers the social/software-engineering question of how to keep the corpus up-to-date. The idea is to make updating it as easy as possible. When the testsuite runs in CI, we test in "AOT ICs" mode that errors out if an IC is generated that is not already in the compiled-in corpus. However, this failure also dumps a new IC file and provides a message with instructions: move this file into js/src/ics/ and rebuild, and the observed-but-not-collected IC becomes part of the corpus. The goal is to catch the case where a developer adds a new IC path and make the corresponding corpus addition as painless as possible.

Another point worth noting is that while there is no guarantee that the engine won't generate novel IC bodies during execution of some program -- for example, during accesses to an object with a very long prototype chain that results in a long sequence of object-prototype checks -- by integrating the corpus check with the testsuite execution, there are aligned incentives. Anyone adding new IC-accelerated functionality should add a test case that covers it; and when we execute this testcase, we will check whether the corpus includes the relevant IC(s) or not, too.

What About JS Bytecode?

We now have a corpus of IC bodies that cover all the relevant cases for, say, the + operator in JavaScript, but we still haven't addressed the actual compilation of JavaScript source. Fortunately, this is now the easiest part: we do a mostly 1-to-1 translation of JS source to code that consists of a series of IC sites, invoking the current IC-stub pointer for each operator instance in turn. The dataflow (connectivity between the operators) and control flow (conditionals and loops) become the compiled code "skeleton" around these IC sites. So, for example, the JS function

function f(x, y) {
  if (y >= 0) {
    return x + y;
  } else {
    return x - y;
  }
}

might become something like the following pseudo-C code sketch:

Value f(Value x, Value y) {
  Value t1 = ics[0](y, 0);  // `>=` operator
  bool t2 = ics[1](t1);     // Value-to-bool coercion
  if (t2) {
    Value t3 = ics[2](x, y); // `+` operator
    return t3;
  } else {
    Value t4 = ics[3](x, y); // `-` operator
    return t4;
  }
}

Given this 1-to-1 compilation of all JS functions in the JS source, and a corpus of precompiled ICs, we have a full ahead-of-time compiler for JavaScript.

Ahead-of-Time (AOT) Compilation of IC-Based JS

It's worth highlighting this fact again: we have built an ahead-of-time compiler for JavaScript. How did this happen? Isn't runtime code generation necessary for efficient compilation of a dynamically-typed language?

The key insight here is that the dynamism has been pushed to runtime late-binding in form of the indirect calls to IC bodies. In other words: behavior may vary widely depending on types; but rather than build all the options in statically, we have "attachment points" for several points, and we can dynamically insert certain behaviors by setting a function pointer rather than generating new code. We're patching together fragments of static code by updating dynamic data instead.

This execution model is known in SpiderMonkey as baseline compilation. The main goal of the baseline compiler in SpiderMonkey is to generate code as quickly as possible, which is a somewhat separate goal to generating code with no runtime observations about that code; but these goals overlap somewhat, as both are relevant at lower tiers of the compiler hierarchy. In any case, a key fact about baseline compilation is: it does not require any type feedback, and can be done with only the JS source and nothing else. It admits fully AOT compilation.

Going Further: Bringing In Runtime Feedback?

Baseline compilation works well enough, but this runtime binding has one key limitation: it means that the compiler cannot optimize combinations of ICs together. For example, if we have a series of IC stubs for each operator in a function, we cannot "blend" these IC bodies into one large function body and allow optimizations to take hold; we cannot propagate knowledge across ICs. One + operator may produce an integer, and within the IC stub for that case, we can make use of the known result type; but the next + operator to consume that value has to do its type-checks over again. This is a fundamental fact of the "late binding": we've pushed composition of behaviors to runtime via the dynamic function-pointer "attachment", and we don't have a compiler at runtime, so there is nothing to be done.

If we want to go further, though, we can learn one more lesson from SpiderMonkey's design: baseline compilation with ICs is a solid framework for re-admitting this kind of whole-function optimization with types. Specifically, SpiderMonkey's WarpMonkey optimizing compiler tier works by allowing baseline code to "warm up" its ICs, collecting the most relevant / most common special cases; then it inlines those ICs at the call sites. And that's it!

Said a different way: putting all type feedback into a "call specialized code stubs" framework reduces type-feedback compilation to an inlining problem. Compiler engineers know how to build inliners; that's far easier than an ad-hoc optimizing JIT tier.

This leads to a natural way we could build a further-optimizing tier on top of our initial AOT JS compiler: build a "profile-guided inliner". In fact, this has been done: my colleague Nick Fitzgerald built a prototype, Winliner, that profiles Wasm indirect calls and inlines the most frequent targets. The idea in the context of JS is to record the most frequent call target T at each IC dispatch site, then replace the indirect-call with if target == T { /* inlined code */ } else { call target }.

The beauty of this approach is that it is semantics-preserving: the transform itself does not know anything about JavaScript engines, and the resulting code will behave exactly the same, so if we have gotten our baseline-level AOT compiler to generate correct code, this optimizing tier will, as well, without new bugs. Winliner and profile-guided inlining let us "level-up" from baseline compilation to optimized compilation in a way that is as close to "for free" as one could hope for. The inlining tool and the baseline compiler can be separately tested and verified; we can carefully reason about each one, and convince ourselves they are correct; and we don't need to worry about bugs in the combination of the two pieces (where bugs most often lurk) because by taking the Wasm ISA as the interface, they compose correctly.

Results

That's all well and good; what are the results?

The final PR in which I introduced AOT compilation to our SpiderMonkey branch quotes these numbers on the Octane benchmark suite (numbers are rates, higher is better):

SpiderMonkey running inside a Wasm module:

- generic interpreter (in production today) vs.
- ahead-of-time compilation (this post)

               interpreter        AOT compilation     Speedup
Richards        166                 729               4.39x
DeltaBlue       169                 686               4.06x
Crypto          412                1255               3.05x
RayTrace        525                1315               2.50x
EarleyBoyer     728                2561               3.52x
RegExp          271                 461               1.70x
Splay          1262                3258               2.58x
NavierStokes    656                2255               3.44x
PdfJS          2182                5991               2.75x
Mandreel        166                 503               3.03x
Gameboy        1357                4659               3.43x
CodeLoad      19417               17488               0.90x
Box2D           927                3745               4.04x
----
Geomean         821                2273               2.77x

or about a 2.77x geomean improvement overall, with a maximum of 4.39x on one benchmark and most benchmarks around 2.5x-3.5x. Not bad! For comparison, the native baseline compiler in SpiderMonkey obtains around a 5x geomean -- so we have some room to grow still, but this is a solid initial release. And of course, as noted above, by inlining ICs then optimizing further, we should be able to incorporate profile-guided feedback (as native JITs do) to obtain higher performance in the future.

Other Approaches?

It's worth noting at this point that a few other JS compilers exist that attempt to do specialized codegen without runtime type observations or other profiling/warmup. Manuel Serrano's Hopc compiler, described in his DLS'18 paper, works by building one specialized version of each function alongside the generic version, and uses a set of type-inference rules to pick the most likely types. This is shown to work quite well in the paper.

The porffor compiler similarly is a fully AOT JS compiler that appears to use some inference to emit the right cases only when possible. That project is a work-in-progress, with support for 39% of ECMA-262 currently, but seems promising.

The main downside to an inference-based approach (as opposed to the dynamic indirection through ICs in our work) is the limit to how far this inference can go: for example, SpiderMonkey's ICs can specialize property accesses on object shape, while Hopc does not infer object shapes and in general such an analysis is hard (similar to global pointer analysis).

I'd be remiss not to note a practical aspect too: there is a large body of code embedding SpiderMonkey and written against its APIs; and the engine supports all of the latest JS proposals and is actively maintained; the cost of reaching parity with this level of support with a from-scratch AOT compiler was one of the main reasons I opted to build on SpiderMonkey (adapting ICs to be AOT and compiling its bytecode) instead.

Up Next: Compiler Backends (For Free)

Astute readers will note that while I stated that we do a compilation from the JS source to Wasm bytecode, and from IC bodies in the corpus to Wasm bytecode, I haven't said how we do this compilation. In my previous post on Portable Baseline, I described implementing an interpreter for IC bodies, and an interpreter for JS bytecode that can invoke these IC bodies. Developing two compiler backends for these two kinds of bytecode (CacheIR and JS) is quite the mountain to climb, and one naturally wonders whether all the effort to design, build, and debug the interpreters could be reused somehow. If you do indeed wonder this, then you'll love the upcoming part 3 of this series, where I describe how we can derive these compiler backends automatically from the interpreters, reducing maintenance burden and complexity significantly. Stay tuned!

Thanks to Luke Wagner, Nick Fitzgerald, and Max Bernstein for reading and providing feedback on a draft of this post!

On the Web, a Wasm module can "trampoline" out to JavaScript to load and instantiate another module with new code, sharing the same heap, then invoke that new module via a shared function-reference table. This is technically workable but somewhat slow and cumbersome, and this mechanism does not exist on Wasm-only platforms. Note that the primary reason is the design choice to allow code-loading only via a control plane; nothing technically stops the platform from providing a direct Wasm hostcall for the same purpose.