Compilers and compiler infrastructure
(Day job) Cranelift compiler framework, Dec 2019 – present.
Developed a new framework for compiler backends, and the initial AArch64 (ARM64) backend using this framework. Developed a new register allocator (regalloc2), including a novel symbolic checker using abstract interpretation for highly-effective fuzzing. Designed and implemented a new pattern-matching DSL (domain-specific language), ISLE, which we now use to write all of our backend lowering rules. Implemented an e-graph (equivalence graph)-based mid-end optimization framework, putting all important rewrites into one fixpoint loop, and built infrastructure to use ISLE to express rewrites here as well. (To our knowledge this is the first production use of e-graphs in a general-purpose compiler.) The compiler has gotten much faster and generates much better code in the past few years and has grown to have four backends (x86-64, aarch64, riscv64, s390x). From 2020 until late 2022 I acted as tech lead; I am spending more time on other projects now but still remain actively involved.
All of this work done with excellent and brilliant collaborators: Julian Seward and Benjamin Bouvier, Nick Fitzgerald, Jamey Sharp, Trevor Elliott, Andrew Brown, and many others!
(Day job) waffle WebAssembly SSA compiler framework, Nov 2021 – present.
For use in Wasm-to-Wasm transform tools, I built an SSA compiler framework with a Wasm frontend (constructing IR from Wasm) and a Wasm backend (lowering to Wasm from IR). The backend in particular involves nontrivial algorithmic work to reconstruct structured control flow and do “register allocation” over Wasm locals from SSA values. Ongoing work on ergonomics and performance but correctness is the first priority: the project includes rigorous differential fuzzing to ensure that a roundtrip through IR and optimizations is semantics-preserving.
(Day job) Port of SpiderMonkey to Wasm/WASI, early 2021.
wasm32-wasiplatform, which mainly required stubbing out the right platform-specific parts. Subsequent work investigated techniques to build IC (inline cache) chains, which ordinarily are not used in SpiderMonkey’s portable interpreter; someday I hope to get back to this!
Hardware design language with novel semantics for pipeline idioms (see Autopiper under Computer Architecture below).
Program analysis, programming language design, and memory management/ownership
(Academic research) PhD dissertation contributing two main ideas: (i) an alias analysis that works on loops, determining when pointers refer to unique objects in each iteration of a given loop (“distinctness”), in particular reasoning about common data structures such as lists, key-value maps, and sets; and (ii) a way to combine static analysis and dynamic observation in a principled way, essentially automatically deriving a hybrid static-dynamic system from the static analysis’ inference rules in a regular way.
This research culminated in some results showing that “irregular” programs, i.e. those that are not just loop nests on arrays but are built around other interesting data structures, often have more parallelism available than a conventional autoparallelizing compiler would reveal.
(Side-project with academic publication) Deferred borrows: a proposal to extend Rust’s ownership and borrowing system to allow for logically memory-safe references to items in collections (vectors, hashmaps and the like) without the restrictions that the borrow-system would impose. Logical memory safety means that a reference will always refer to the same item: in particular, it is tied (at the type-system level) to a particular collection, and the collection is constrained in a way that prevents the item from being removed.
These “deferred borrows” build on path-dependent types and provide most of the benefits of true borrows, i.e., non-dangling references to objects in a way that cannot be broken, without imposing onerous limits on which borrows can be simultaneously active (e.g. no more than one mutable borrow at a time).
(Side research project) Autopiper, a high-level synthesis (HLS) tool that compiles a sequential, imperative programming language to a digital logic hardware description in Verilog. This project was an experiment in bridging sequential stateful computation semantics and pipelined hardware with a notion of “transactional state”: the compiler reasons about state across time and pipeline stages, and can generate bypass logic, speculation (and recovery from misspeculation), and maybe eventually automatically perform other optimizations. This was an attempt to formalize a lot of the design patterns that a computer architect working on a high-performance out-of-order microarchitecture would use, effectively “deriving from first principles” a CPU core from ISA semantics. Lofty goal and somewhat unfinished but all the primitives are there!
(Academic research) Heterogeneous Block Architecture, a proposal for a high-performance, energy-efficient CPU microarchitecture design that incorporates a conventional out-of-order superscalar core and a VLIW (very-long instruction word) in-order core that runs prescheduled code based on dynamic recordings of the superscalar core’s activity. The work proposes both a framework to combine multiple execution substrates (the two types of core) into one hybrid core, running a single thread of the user’s program in a transparent way, by the use of dynamically-formed atomic superblocks and hardware to communicate control-flow and livein/liveout values between blocks.
To evaluate this work, I built a very detailed microarchitectural simulator that runs unmodified x86-64 programs. The simulator is an execute-at-execute design, meaning that it splits instructions into uops and sends these through a simulated pipeline with real program values. It performs speculation, computes values down a speculative program path, and recovers from misspeculation as the hardware does. (A simpler approach often used instead is to execute the program normally, record a trace of its instructions, and then feed these into a timing model that only has to be mostly correct and need not handle speculation. This simpler approach is faster but is not quite accurate enough for detailed microarchitectural work.) Unfortunately the full simulator is not publicly available, but I have released x86instlib, the x86 model underlying the decode and execute stages (in turn based on parts of PTLsim).
Dynamic translation-based stack machine: a pipelined CPU in Verilog with special hardware register stack handling to allow execution of a virtual stack machine, together with dynamic translation firmware that JITs the bytecode to the native core. This was an undergrad computer architecture final project.
(Day job) contributions to Wasmtime, a WebAssembly virtual machine / runtime that uses Cranelift as its compiler backend. In addition to my work on Cranelift, I’ve done work on Wasmtime’s memory management and instantiation performance: using copy-on-write mmaps and making as much data structure initialization lazy as possible, in order to get instantiation down into the low-microsecond range. I also built the “epoch interruption” mechanism, a much lower-cost alternative to “fuel” that allows for bounded runtime with periodic yields/interrupts in order to work well with high-concurrency event-loop architectures.
(Day job) contributions to protobuf (protocol buffers, a serialization format often used as an in-memory data structure as well) as a member of the team at Google. I co-developed arena allocator support and led its launch internally, which involved modifying the (often very hot) generated code while avoiding any performance or heap-space regressions in production workloads; making use of the more efficient allocation infrastructure sometimes led to significant performance wins. The discussions surrounding API semantics and compatibility were especially interesting.
(Side project) speck, the small, portable, efficiently-coded kernel: a hobby operating system developed in 2001—2002, when I was much younger. It runs on bare x86 and implements preemptive multitasking, message passing, and dynamic kernel-module loading.