For the past year, I have been hard at work trying to improve the performance of the SpiderMonkey JavaScript engine when compiled as a WebAssembly module. For server-side applications that use WebAssembly (and WASI, its "system" layer) as a software distribution and sandboxing technology with significant exciting potential, this is an important enabling technology: it allows existing software written in JavaScript to be run within the sandboxed environment and to interact with other Wasm modules.

Running an entire JavaScript engine inside of a Wasm module may seem like a strange approach at first, but it serves real use-cases. There are platforms that accept WebAssembly-sandboxed code for security reasons, as it ensures complete memory isolation between requests while remaining very fine-grained (hence with lower overheads). In such an environment, JavaScript code needs to bring its own engine, because no platform-native JS engine is provided. This approach ensures a sandbox without trusting the JavaScript engine's security -- because the JS engine is just another application on the hardened Wasm platform -- and carries other benefits too: for example, the JS code can interact with other languages that compile to Wasm easily, and we can leverage Wasm's determinism and modularity to snapshot execution and then perform extremely fast cold startup. We have been using this strategy to great success for a while now: we did the initial port of SpiderMonkey to WASI in 2020, and we wrote two years ago about how we can leverage Wasm's clean modularity and determinism to use snapshots for fast startup.

This post is an update to that effort. At the end of that prior post, we hinted at efforts to bring more of the JavaScript performance engineering magic that browsers have done to the JS-in-Wasm environment. Today we'll see how we've successfully adapted inline caches, achieving significant speedups (~2x in some cases) without compromising the security of the interpreter-based strategy. At the end of this post, I'll hint at how we plan to use this as a foundation for ahead-of-time compilation as well. Exciting times ahead!

Note: the design document is also available, and SpiderMonkey patches can be found on the upstreaming bug.

Current State: Interpreters Only Beyond This Point

A distinguishing feature of some platforms is that they do not allow runtime code generation. For example, a WebAssembly module may contain functions; these functions are compiled at some point before the functions are run; but the functions cannot do anything to create new functions in the same code space, at least without going through some lower-level and nonstandard system interface.1

On the other hand, JavaScript engines over the past several decades have embraced runtime code generation. The basic reason for this is that there are a lot of facts about a JS program that one cannot know (or not easily, without human intelligence and reasoning and a view of the whole program) until the program executes. For example: in the simple one-line function function(x) { return x + x; }, what is x's type? It could be an integer, a floating-point number, a string, or an object that can be converted to a string, or probably many other things.2 If one were to try to generate machine code for that function ahead-of-time, it would look very different than that of, say, a C function with the same return x + x; body but an integer-typed x. It would have to contain type-checks, branches, and implementations for all the different possible cases. With that dynamic dispatch overhead, and the bloated and difficult-to-optimize function body that supports all combinations of types, we are unlikely to see much speedup unless we can know something about the types. A similar problem arises in other "dynamic" aspects of the language: when we say obj.x, where in the memory for the object obj is the field x? When we call a function f(1, 2, 3), is f another JavaScript function, a native function in the runtime, or something even more special that we handle in a different way?

The modern JavaScript engine's answer to performance, then, is to collect a lot of information as a program executes and then dynamically generate machine-code that is specialized to what the program is actually doing, but can fall back to the generic implementation if conditions change. Because we can't generate this code until the program is already running, we need the platform to support the ability to add more executable code as we are running: this is "runtime codegen", or as it is often known, "JIT (just-in-time) compilation".

So we have a conundrum: we run JavaScript inside a Wasm module, we want performance, but the usual way to get that performance is to JIT-compile specialized machine-code versions of the JS code after observing it, and we can't do that from within a Wasm module. What are we to do?

Systematic Fast-Paths: Inline Caches

It's worth understanding how JITs specialize the execution of the program based on runtime observations. A simple approach might be to build in "ad-hoc" kinds of observations to the interpreter, and use those as needed in a type-specific "specializing" compiler. For example, we could record types seen at a + operator at a given point in the program, and then generate only the cases we've observed when we compile that operator (perhaps with "fallback code" to call into a generic implementation if our assumption becomes wrong). However, this ad-hoc approach does not scale well: every semantic piece (operators, function calls, object accesses) of the language implementation would have to become a profiler, an analysis pass, and a profile-guided compiler.

Instead, JITs specialize with a general dispatch mechanism known as "inline caches" (ICs), and ICs build straight-line sequences of "fast-paths" in a uniform program representation.

The usual approach is to define certain points in the original program at which we have an operator with widely-varying behavior, and place an inline cache site in the compiled code. The idea of an inline cache site is that it performs an indirect dispatch to some other "stub": these are "pluggable implementations" that replace a generic operator, like +, with some specific case, like "if both inputs are int32-typed, do an integer add". For example, we might compile the following function body into the polymorphic (type-generic) code on the left, then generate the specialized fast-paths and attach them as stubs on the right:

Figure: Inline-cache stubs in a JavaScript function

The IC site starts with a link to a generic implementation -- just like the naive interpreter -- that works for any case. However, after it executes, it also generates a fast-path for that case and "attaches" the new stub to the IC site. The stubs form a "chain", or singly-linked list, with the generic case at the end. Some examples of fast-paths that we see in practice are:

  • For an access to a property on an object, we can generate a fast-path that checks whether the object is a known "shape" -- defined as the set of existing properties and their layout in memory -- and directly accessing the appropriate memory offset on a match. This avoids an expensive lookup by property name.

  • For any of the polymorphic built-in operators, like +, we can generate a fast-path that checks types and does the appropriate primitive action (integer addition, floating-point addition, or string concatenation, say).

  • For a call to a built-in or "well-known" function, we can generate a fast-path that avoids a function call altogether. For example, if the user calls String.length, and this has not been overridden globally (we need to check!) and the input is a string, then the IC can load the string length directly from the known length-field location in the object. This replaces a call into the JS runtime's native string-implementation code with just a few IC instructions.

Each stub has a simple format: it checks some conditions, then either does its action (if it is a matching fast-path) or jumps to the next stub in the chain.

This collection of stubs, once "warmed up" by program execution, is useful in at least two ways. First, it represents a knowledge-base of the program's actual behavior. The execution has been "tuned" to have fast-paths inserted for cases that are actually observed, and will become faster as a result. That is quite powerful indeed!

Second, an even more interesting opportunity arises (first introduced in the WarpMonkey effort from the SpiderMonkey team, to their knowledge a novel contribution): once we have the IC chains, we can use the combination of two parts -- the original program bytecode, and the pluggable stub fast-paths -- to compile fully specialized code by translating both to one IR and inlining. This is how we achieve specialized-variant compilation in a systematic way: we just write out the necessary fast-paths as we need them, and then we later incorporate them.3 The following figure illustrates this process:

Figure: optimized JS compilation by inlining ICs

In the SpiderMonkey engine, there are three JIT tiers that make use of ICs:

  • The "baseline interpreter" interprets the JS function body's opcodes, but accelerates individual operations with ICs. The interpreter-based approach means we have fast startup (because we don't need to compile the function body), while ICs give significant speedups on many operations.

  • The "baseline compiler" translates the JS function body into machine code on a 1-for-1 basis (each JS opcode becomes a small snippet of machine code), and dispatches to the same ICs that the baseline interpreter does. The main speedup over the baseline interpreter is that we no longer have the "JS opcode dispatch overhead" (the cost of fetching JS opcodes and jumping to the right interpreter case), though we do still have the "IC dispatch overhead" (the cost of jumping to the right fast-path).

  • The optimizing compiler, known as WarpMonkey, inlines ICs and JS bytecode to perform specialized compilation.

We can summarize the advantages and tradeoffs of these tiers as follows:

NameData RequiredJS opcode dispatchICsOptimization scopeCacheIR dispatchCodegen at runtime
Generic (C++) interpreterJS bytecodeInterpreter (C++)NoneNoneN/ANo
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Baseline interpreterJS bytecode + IC data structuresInterpreter (generated at startup)Dynamic dispatchSpecial cases within one opcode via ICCompiledYes (IC bodies, interp body)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Baseline compilerJS bytecode + IC data structuresCompiled function body (1:1 with bytecode)Dynamic dispatchSpecial cases within one opcode via ICCompiledYes (IC bodies, function bodies)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Optimizing compiler (WarpMonkey)JS bytecode + warmed-up IC chainsCompiled function body (optimized)InlinedAcross opcodes / whole functionCompiledYes (optimized function body)

Can we Use ICs in a WASI Program?

Given that we have a means to speed up execution beyond that of a generic interpreter, namely, inline caches (ICs), and given that SpiderMonkey supports ICs, surely we can simply make use of this feature in a build of SpiderMonkey for WASI (i.e., when running inside of a Wasm module)?

Not so fast! There are two basic problems:

  • As designed, the IC stubs can only be run as compiled code. Even the "baseline interpreter" above will invoke a pointer to an IC stub of machine code compiled with a dedicated IC-stub compiler.

    This works well for SpiderMonkey on a native platform -- the fastest way to implement a fast-path is to produce a purpose-built sequence of a handful of machine instructions -- but is not compatible with WebAssembly's inability to add new code at runtime that we noted above. This is because SpiderMonkey only knows what the fast-paths should be after it starts executing, which is too late to add code to the Wasm module.

  • Less fundamental, but still a roadblock: the "baseline interpreter" in SpiderMonkey is also JIT-compiled, albeit once at JS engine startup rather than as code is executing. This is more of an implementation/engineering tradeoff, wherein the SpiderMonkey authors realized they could reuse the baseline compiler backend to cheaply produce a new tier (a brilliant idea!), but again is not compatible with the WASI environment.

You might already be thinking: the above two points are not laws of nature -- nothing says that we can't interpret whatever code we would have JIT-compiled and executed in native SpiderMonkey. And you would be right: in fact, that's the starting point for the Portable Baseline Interpreter (PBL)!4

A Baseline without JIT: Portable Baseline

Here we can now introduce the Portable Baseline Interpreter, or PBL for short. PBL is a new execution tier that replaces the native "baseline interpreter" described above. Its key distinguishing feature is that it does not require any runtime code generation (JIT-compilation). Thus, it is suitable for use in a Wasm/WASI program, or in any other environment where runtime codegen is prohibited.

The key design principle of PBL is to stick as closely as possible to the other baseline tiers. In SpiderMonkey, significant shared machinery exists for the (existing) baseline interpreter and baseline compiler: there is a defined stack layout and execution state, there is code that understands how to garbage-collect, introspect, and unwind this state, and there are mechanisms to track the inline caches associated with baseline execution. PBL's goal at a technical level is to perform exactly the work that the (native) baseline interpreter would do, except in portable C++ code rather than runtime-generated code.

To achieve this goal, the two major tasks were:

  • Implementing a new interpreter loop over JS opcodes. We cannot use the generic interpreter tier's main loop (what SpiderMonkey calls the "C++ interpreter"), because the actions for each opcode in that implementation are "generic" -- they do not use ICs to specialize on types or other kinds of fast-paths -- and so are not suitable for our purposes. Likewise, we cannot use the baseline interpreter's main loop because it is generated at startup using the JIT backend, and so is not suitable for use in a context where we can only run portable C++.

    Instead, we need to implement a new interpreter loop whose actions for each opcode invoke ICs where appropriate -- exactly the actions that the baseline interpreter does, but written in portable code. This is superficially "simple", but turns out to require careful attention to many subtle details, because handwritten JIT-compiled code can control some aspects of execution much more precisely than C++ ordinarily can. (More on this below!)

  • Implementing an interpreter for CacheIR, the intermediate representation in which the "fast-path code" for IC stubs is represented. CacheIR opcodes encode the "guards", or preconditions necessary for a fast-path to apply, and the actions to perform. There are many specialized CacheIR opcodes to particular data structures or runtime state -- it is a heavily custom IR -- but this tight fit to SpiderMonkey's design is exactly what gives it its ability to concisely encode many fast-paths.

In principle, developing an interpreter for an IR that already has two compilers (to machine code and optimizing compiler IR (MIR)) should be relatively straightforward: we transliterate the actions that the compiled code is performing into a direct C++ implementation. In a system as complex as a JavaScript engine, though, nothing is ever quite "simple". Challenges encountered in implementing the CacheIR interpreter fall into two general categories: aspects of execution that cannot be directly replicated in C++ code, so need to be "emulated" in some way; and achieving practical performance by keeping the "virtual machine" model lightweight and playing some other tricks too. We'll give a few examples of each kind of challenge below.

Challenge: Baseline Stack Layout

The first challenge that arose consists of emulating the stack as the JIT-compiled code would have managed it. SpiderMonkey's baseline tiers build a series of stack frames with a carefully-defined format that can be traversed for various purposes: finding (and updating) garbage-collection roots, handling exceptions and unwinding execution, producing stack backtraces, providing information to the debugger API and allowing the debugger to update state, and so on.

The format of a single stack frame consists of a JitFrame that looks a lot like a "normal" function prologue's frame -- return address, previous frame pointer -- but also includes a "callee token" that the caller pushes, describing the called function at the JS level, and a "receiver" (the this value in JavaScript). The BaselineFrame below that records the JS bytecode virtual machine state in a known format, so it can be introspected: current bytecode PC, current IC slot, and so on. Below that, the JS bytecode VM's operand/value stack is maintained on the real machine stack. And, just before calling any other function, a "footer" descriptor is pushed: this denotes the kind of frame that just finished, so it can be handled appropriately.

This format has a very important property: it has no gaps. It is not simply a linked list of fixed-size descriptor or header structures. If it were, we could potentially place BaselineFrame / JitFrame instances on the C++ stack in PBL, and link them together with the previous-FP fields as normal. But this won't work: rather, every machine word of the baseline-format stack is accounted for.

Figure: baseline stack

This works fine for JIT-compiled code, because we control the code that is emitted and can maintain whatever stack-format we define. But because the C++ compiler owns and manages the machine stack layout when we are running in C++ code, PBL is not able to maintain the actual machine stack in this format.

Thus, we instead define an auxiliary stack, build a series of real baseline frames on it, and maintain this in parallel to the executing C++ code's actual machine stack. When we enter a new frame at the C++ level, we push a new frame on the auxiliary stack; when we return, we pop a frame.5 This auxiliary stack is what the garbage collector, exception unwinder, debugger, and other components introspect: we store pointers to its frames in the JIT state, and so on. As far as the rest of the engine is concerned, it is the real stack. The only major difference is that all return addresses are nullptrs: we don't need them, because we still manage control flow at the C++ level.

Challenge: Unwinding

A second issue that arises from the differences between a native machine model and that of PBL is unwinding. In JIT code, where we have complete control over emitted instructions, the call stack is just a convention and we are free to skip over frames and jump to any code location we please. The exception unwinder uses this to great effect: when an exception is thrown, the runtime walks the stack and looks for any appropriate handler. This might be several call-frames up the stack. When one is found, it sets the stack pointer and frame pointer to within that frame -- effectively popping all deeper frames in one action -- and jumps directly to the appropriate program counter in that handler's frame. Unfortunately, this is not possible to do directly in portable C++ code.6

Instead, starting from the invariant that one C++ frame in the PBL interpreter function "owns" one (or more as an optimization -- see below) baseline frames, we implement a linear-time unwinding scheme: each C++ interpreter invocation remembers its "entry frame"; when unwinding, after an exception or otherwise, we compare the new frame-pointer value to this entry frame; if "above" (higher in address, for a downward-growing stack), we return from the interpreter function with a special code indicating an unwind is happening. The caller instance of the PBL interpreter function then performs the same logic, propagating upward until we reach the correct C++ frame. The following figure contrasts the native and PBL strategies:

Figure: native baseline and PBL unwinding

Thus, we do not have the same asymptotic O(1) unwind-efficiency guarantee that native baseline execution does, but we remain completely portable, able to execute anywhere that standard C++ runs.

Challenge: VM exits

A third issue that often arose was that of emulating VM exits. On the native baseline platform, when JIT code is executing, the stack is "under construction", in a sense: the innermost frame is not complete (there is no footer descriptor word) and is not reachable from the VM data structures. JIT code can call back into the runtime only via a carefully-controlled "VM exit" mechanism, which pushes a special kind of "exit frame", records the end of the series of contiguous JIT frames (the "JIT activation"), and then invokes C++ code:


# JIT code:

  ...
  push arg2
  push arg1                # trampoline
  call VMHelper  ----->    push exit frame
                           cx->exitFP = fp
                           call VMHelperImpl    ----->  walkStack(cx->exitFP)
                                                        doStuff(cx)
                           pop exit frame       <-----  ret
  ...            <-----    ret

which results in a stack layout that looks like:

Figure: the baseline stack after a proper VM exit

While executing within the C++ PBL interpreter function, it is very tempting to simply call into the rest of the runtime as required. This results in a stack that looks like the below, and unfortunately breaks in all sorts of exciting and subtle ways: it may appear to work, but frames are missing and GC roots are not updated after a moving GC; or if the dangling exit FP is not null, an entirely bogus set of stack frames may be traced. Either way, various impossible-to-find bugs arise.

Figure: the baseline stack after calling into the runtime without a proper VM exit

PBL thus requires extreme discipline in separating "JIT-code mode" (or its emulation, in a portable C++ interpreter) and "runtime mode". To make this distinction clearer, I designed a type-enforced mechanism that leverages an important idiom in SpiderMonkey: every function that might perform a GC or otherwise introspect overall VM state will take a JSContext parameter. In the PBL interpreter function, we hide the JSContext (rename the local and set it to nullptr normally). We then have a helper RAII class that pushes an exit frame and does everything that a "VM exit" trampoline would do, then behaves as a restricted-scope local that implicitly converts to the true JSContext. This looks like the below:

  CASE(Opcode) {
    Value arg0 = POP().asValue();  // POP() is a macro that uses the `sp` local.
    Value arg1 = POP().asValue();
    // here, `sp` is the top-of-stack for our in-progress frame in our
    // in-progress activation. We are "in JIT code" from the engine's
    // perspective, even though this is still C++.
    
    {
      // This macro completes the activation and creates a `cx` local
      // that gives us the JSContext* for use.
      PUSH_EXIT_FRAME(); 
      
      if (!DoEngineThings(cx, arg0, arg1)) {
        goto error;
      }
      
    } // pops the exit frame.
    
    // ...
  }

This idiom works fairly well in practice and statically prevents us from making most kinds of stack-frame-related mistakes.

Optimization: Avoiding Function-Call Overhead

At this point, we have introduced techniques to enable PBL to run correctly; we now have a functioning JavaScript interpreter that can invoke ICs. (Take a breath and celebrate!) Unfortunately, I arrived at this point and found that performance still lagged behind that of the generic interpreter. How could this be, when ICs directly encode fast-paths and allow us to short-circuit expensive runtime calls?

The first realization came after profiling both a native build of PBL, and especially a Wasm build: C++ function calls can be expensive. The basic PBL design consisted of a JS interpreter that invoked the IC interpreter for every opcode with an IC -- a majority of them, in most programs (all numeric operators, property accesses, function calls, and so on!). Thus function calls are extremely frequent. Their high cost is for a few basic reasons:

  • When the interpreter function is large and has a lot of context (live variables), register pressure is high; when the called function is similar, we effectively have a full "context switch" (save all register values and use for new variables) on every call/return.

  • Splitting logic across multiple functions precludes optimizations that span the logic of both functions. For example, the IC interpreter "reified control flow as data" by returning an enum value that the JS interpreter then switched on. Combining the two functions would allow us to embed the switch-bodies directly where the return code is set.

  • On many Wasm implementations, including Wasmtime (my VM of choice and the main optimization target for our WASI port), function prologues have some extra cost: the generated code needs to check for stack overflow, and may need to check for interruption or preemption. This is a part of the cost of sandboxing that can only be avoided by staying within a single function frame.

Thus, it is very important to avoid function-call overhead whenever possible. I optimized this in two ways. First, the IC interpreter is aggressively inlined into the JS interpreter. This produces one super-interpreter that can run both kinds of bytecode -- JS bytecodes and CacheIR -- without extra frame setup at every IC site.

Second, and more important in practice, multiple JS frames are handled by one C++ frame (interpreter invocation). In a technique borrowed from SpiderMonkey's generic interpreter, when certain conditions are met, we handle a JS call opcode by pushing a JS frame and dispatching directly to the callee's first opcode without any C++-level call. (This may be an obvious implementation to anyone who has written an interpreter virtual machine before, but disentangling C++ frames and JS frames is actually not trivial at all, given the prologue/epilogue logic -- hence the required conditions!) This interacts in subtle ways with unwinding described above: it means that the mapping from JS to C++ frames is 1-to-many, and thus requires some care. (As a silver lining, however, the logic for a "return" is substantially similar to that for "unwind": we can use the same conditions to know when to return at the C++ level.)

Optimization: Hybrid ICs

Having implemented all of the above techniques, I was still finding PBL to have somewhat disappointing performance numbers. Fortunately, one final insight came: perhaps the tradeoffs related to which operations are profitable to fast-path change, when the cost of the fast-path mechanism (an IC) itself changes?

For example: in native baseline execution, every arithmetic operator uses ICs to dispatch to type-specific behavior. The + operator, our favorite example, has possible fast-paths for integers, floating-point numbers, strings, and more. This is profitable in "native baseline" because the cost of an IC is extremely low: the JIT controls register allocation so it can effectively do global allocation across the function body and IC stubs by using special-purpose registers and a custom calling convention, and it can avoid generating any prologue/epilogue in the IC stubs themselves. As a result, ICs can literally be a handful of instructions: call, check type tag in registers 0 and 1, integer add, return. PBL, in contrast, is both emulating virtual-machine state (rather than using an optimized IC calling convention), and paying the interpreter-dispatch cost for every IC opcode.

So I ran a simple experiment: in a native PBL build, I added rdtsc (CPU time counter)-based timing measurements around execution of each JS opcode both in the generic interpreter and in PBL's interpreter loop, and binned the results by opcode type. The results were fascinating: property accesses (e.g., GetProp) were significantly faster with ICs, for example, but many simpler operators, like Add, were twice as slow.

Then given this data, I developed the "hybrid ICs" approach, namely: use ICs only where they help! For the Add operator, the PBL interpreter now has specific cases for integer and floating-point addition, and then invokes the generic interpreter's behavior (AddOperation); it never invokes the IC chain, but rather skips over it entirely. This behavior is configurable -- with faster IC mechanisms in the future, we may be able to use ICs for these opcodes again, so the code for both strategies remains.

The results were striking: PBL was finally showing significant speedups on almost all benchmarks. The final "hybrid IC" set includes:

  • Property accesses. These are extremely common in most JavaScript code, and can benefit from fast-path behavior whenever objects usually have the same "shape", or set of properties, at a given point. This is because the engine can encode a fast-path that directly accesses a particular memory offset in the object in memory, without looking up a property by name.

  • Calls. This is somewhat less intuitive: for an ordinary call to another JavaScript function, there is not much an IC can do -- we just need to update interpreter state to the callee and start dispatching. But for calls to built-in functions, as described above, the benefits can be huge: string and array operations, for example, transform from an expensive call into the runtime (through several layers of generic JS function-call logic) into just a few direct field accesses or other operations on a known object type.

Every other JS opcode is executed with generic logic.

Results

Enough description -- how well does it perform?

The best test of any language runtime or platform is a "real-world" use-case, and PBL has been fortunate to see some early adoption, where two real applications saw wall-clock CPU time reductions of 42% and 17%, respectively, when executing on a Wasm/WASI platform. That is quite significant and exciting, and is motivating adoption and further work of PBL.

While developing PBL, I did most of my benchmarking with Octane, which is deprecated but still useful when hacking on the core of a JS engine (one just needs to give the appropriate caveats that benchmark speedups will have an uncertain correlation to real-world speedups). On Octane, PBL currently sees a 1.26x speedup (that is, throughput is 26% higher; or, equivalently, there is a runtime reduction of 1 - 1/1.26, or 21%). That is quite something as well, for a new engine tier that remains completely portable as a pure interpreter!

Because of these exciting results, and our future plans below, we have worked with the SpiderMonkey team themselves to plan upstreaming -- incorporating PBL into the main SpiderMonkey tree. This will ease maintenance because it will allow PBL to be updated and evolved (i.e., kept compiling and running) as SpiderMonkey itself does, will allow us to use SpiderMonkey without a heavy patch-stack on top, and will make PBL available for others to use as well. We believe it could be useful beyond the Wasm/WASI world: for example, high-security contexts that disallow JIT could benefit as well. The upstreaming code-review is in-progress and we look forward to completing it!

Future: Compiled Code

Note: this section describes my own thoughts and plans, but goes beyond what is currently being upstreamed into SpiderMonkey, and is not necessarily endorsed yet by upstream. My plan and hope is to develop the ideas to maturity and, if results hold up, propose additional upstreaming -- but that is further out.

PBL has an attractive simplicity as a pure interpreter, and has surprised us with speedups even under that restriction. However, the larger question, for me at least, has always been: how can we compile JS ahead-of-time in a performant way?

Recall that the main restriction of the WebAssembly platform is not that we can't generate code at all; it's just that all code, no matter the producer (the traditional Wasm compiler toolchain or our own JS tools), needs to be generated before any execution occurs.

SpiderMonkey's native baseline tiers hint at a way forward here. PBL as described above is roughly equivalent to the baseline interpreter (modulo the way that ICs are executed). Can we (i) produce compiled code for ICs, and (ii) do the equivalent of the baseline compiler, generating a specialized Wasm function for every JS function body?

In principle, this should be possible without information from execution, because it handles the type-specific specialization with the runtime dispatch inherent in the IC chains. In other words, types are late-binding, so we retain late-binding control flow to match.

This still requires us to know what possible ICs we might need, but here we can play a trick: we can collect many IC bodies ahead of time, and generate straight-line compiled Wasm functions for these IC bodies. This is more-or-less the trick we described in our post two years ago.

But all of this is still implying the development of a Wasm compiler backend. How does PBL help us at all? Isn't it a dead-end, if we are eventually able to compile JS source (which we typically have available ahead-of-time -- performance-critical eval() usage is rare) straight to specialized Wasm, with late-bound ICs?

The answer to that lies in partial evaluation. Over the past year I have developed a tool called weval that takes an interpreter in a Wasm module, with a few minor intrinsic-call annotations (to specify what to specialize, and to denote that memory storing bytecode is "constant" and can be assumed not to self-modify dynamically), and generates a Wasm module with specialized functions appended. This gives us a compiler "for free" once we have an interpreter, and PBL has been designed to be that interpreter.

In particular, the JS opcode and IC opcode interpreters in PBL were designed carefully to work efficiently with weval, and in a next step to the project (development branch), I have the whole thing working. Whereas pure-interpreter PBL got a 1.26x speedup on Octane, PBL with weval gets a 1.58x speedup, up to 2.4x or so, with a bunch of low-hanging fruit remaining that will hopefully push that number further.

This combination isn't quite ready for production use yet, but I continue to polish it, and we hope sometime early next year it will be ready, taking us to "conceptual parity" (if not engineering fine-tuning parity!) with SpiderMonkey's native baseline compiler. We have some more thoughts on going beyond that -- inlining ICs like WarpMonkey does, hoisting guards, and all the rest -- but more on that in due time.

Given all of that, one could compare PBL and PBL+weval to SpiderMonkey's existing tiers. Recall our table above:

NameData RequiredJS opcode dispatchICsOptimization scopeCacheIR dispatchCodegen at runtime
Generic (C++) interpreterJS bytecodeInterpreter (C++)NoneNoneN/ANo
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Baseline interpreterJS bytecode + IC data structuresInterpreter (generated at startup)Dynamic dispatchSpecial cases within one opcode via ICCompiledYes (IC bodies, interp body)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Baseline compilerJS bytecode + IC data structuresCompiled function body (1:1 with bytecode)Dynamic dispatchSpecial cases within one opcode via ICCompiledYes (IC bodies, function bodies)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Optimizing compiler (WarpMonkey)JS bytecode + warmed-up IC chainsCompiled function body (optimized)InlinedAcross opcodes / whole functionCompiledYes (optimized function body)

To which we could add the row:

NameData RequiredJS opcode dispatchICsOptimization scopeCacheIR dispatchCodegen at runtime
PBL (interpreter)JS bytecode + IC data structuresInterpreter (C++)Dynamic dispatchSpecial cases within one opcode via ICInterpreter (C++)No

And then, with weval and pre-collected ICs (but no profiling of the JS code!), we could have:

NameData RequiredJS opcode dispatchICsOptimization scopeCacheIR dispatchCodegen at runtime
PBL (wevaled)JS bytecode + IC data structuresCompiled function body (1:1 with bytecode)Dynamic dispatchSpecial cases within one opcode via ICCompiledNo (!!)

which one will note is identical to the baseline-compiler row above, except that no runtime codegen is required.

Finally, if we have reliable profiling information, such as from a profiling run at build time, we could use this profile (just as one does in a standard C/C++ "PGO" or "profile-guided optimization" build) to inline the ICs. Note that this could be done in a way that is completely agnostic to the underlying interpreter, because IC invocations are just indirect calls: that is, it is also a semantics-preserving, independently-verifiable transform. Having done that, we would then have:

NameData RequiredJS opcode dispatchICsOptimization scopeCacheIR dispatchCodegen at runtime
PBL (wevaled + inlined ICs)JS bytecode + IC data structures + warmed-up IC chainsCompiled function body (optimized)InlinedAcross opcodes / whole functionCompiledNo (!!)

which approximates WarpMonkey. Note that this will require significant additional engineering -- SpiderMonkey's native JITs, after all, embody engineer-centuries of effort (much of which we leverage by reusing its well-tuned CacheIR sequences, but much which we can't) -- but is a clear path to allow for optimized JS without runtime code generation.

The thing that excites me most about this direction is that it is, in some sense, "deriving a JIT from scratch": we are writing down the semantics of the opcodes, and we're explicitly extracting fast-paths, but we're using semantics-preserving tools to go beyond that. (Weval's semantics are that it provides a function pointer to a specialized function that behaves identically to the original.) That allows us to decouple the correctness aspects of our work from performance, mostly, and makes life far simpler -- no more insidious JIT bugs, or divergence between the interpreter and compiler tiers. More to come!


Many, many thanks to Luke Wagner, Nick Fitzgerald, Trevor Elliott, Jamey Sharp, L Pereira, Michelle Thalakottur, and others with whom I've discussed these ideas over the past several years. Thanks to Luke, Nick, Lin Clark, and Matt Gaudet for feedback on this post. Thanks also to Trevor Elliott and Jake Champion for help in getting PBL integrated with other infrastructure, Jamey Sharp for ramping up efforts to fill out PBL's CacheIR opcode support, and the Mozilla SpiderMonkey team for graciously hearing our ideas and agreeing to upstream this work.

1

WebAssembly running in a browser could implement runtime code generation and loading by calling out to external JavaScript. In essence, it would generate a new Wasm module in memory, then call JS to instantiate that module into an instance that shares the current linear memory, and call into it. However, this is fundamentally a feature of the Web platform and not built-in to Wasm; and many Wasm platforms, especially those designed with security among untrusted co-tenants in mind, do not allow this and strictly enforce ahead-of-time compilation of fixed code instead.

2

Honestly, even after writing a new interpreter tier for SpiderMonkey, I couldn't tell you the answer to this. (I run the bytecode, I don't lower to it!) The language's semantics are something to marvel at, in the edge cases, and this is all the more reason to centralize on a few well-tested, well-engineered shared implementations.

3

Note that there are some details here omitted for simplicity. Most importantly, once inlining ICs, we need to hoist the guards, or the conditions that exist at the beginning of every IC, so that they are shared in common (or removed entirely, if we can prove they will always succeed). Consider a function that operates on floating-point numbers: every IC will be some form of "check if inputs are floats, do operation, tag result as float" but instead we could check that the function arguments are floats once, then propagate from "produce float" in one IC to "check if float" in the next. ICs enable this by expressing the necessary preconditions (checks) and postconditions (produced values and their types) for each operator, and inlining is necessary as well because it places everything in one code-path so it is in scope to be cross-optimized; but guard-hoisting and -elimination are JIT-compiler-specific optimizations.

4

One can see this idea as a variant of the interpreter quickening idea, in a way: the CacheIR sequences are a shorter or more efficient implementation of particular behavior that we rewrite the interpreted program to use (via the pluggable IC sites) as we learn more about its execution.

5

The correspondence isn't actually 1-to-1, unfortunately (that would have been much simpler!): instead, we sometimes push an additional frame for VM exits, and we also handle some calls "inline", pushing the frame and going right to the top of the dispatch loop again. The actual invariant is that every auxiliary stack frame is "owned" by one C++ function invocation, but there may be several such frames. It is thus a 1-to-many relationship.

6

Strictly speaking, we could have used setjmp()/longjmp() to implement similar constant-time unwinding. However, this interacts poorly with C++ destructors, and is also problematic -- that is, does not exist -- on WebAssembly with WASI. Eventually the exception handling proposal for Wasm may be directly usable for this purpose, but it is not finalized yet.