Compilers and compiler infrastructure
(Day job) Cranelift compiler framework, Dec 2019 – present. Developed, with Julian Seward and Benjamin Bouvier, a new framework for compiler backends in Cranelift (including new instruction selection and register allocation implementations), and a fully functional AArch64 (ARM64) backend using this framework. Novel aspects include an abstract interpretation-based checker for our register allocator, used with fuzzing to great effect in finding bugs; and a fast CFG lowering and branch folding / optimizing algorithm that emits final code directly from a CFG without intermediate flattening or branch-editing steps. The framework is designed for easy development of new backends, and has already received interest from community members developing other backends for ARM32 and IBM z-Series architectures.
This compiler backend implements all of the WebAssembly MVP spec, reference types (precise garbage collection support), and a number of other features, and should ship as the optimizing Wasm backend in Firefox 82 on AArch64 (late Sept 2020) if all goes well.
Hardware design language with novel semantics for pipeline idioms (see Autopiper under Computer Architecture below).
Program analysis, programming language design, and memory management/ownership
(Academic research) PhD dissertation contributing two main ideas: (i) an alias analysis that works on loops, determining when pointers refer to unique objects in each iteration of a given loop (“distinctness”), in particular reasoning about common data structures such as lists, key-value maps, and sets; and (ii) a way to combine static analysis and dynamic observation in a principled way, essentially automatically deriving a hybrid static-dynamic system from the static analysis’ inference rules in a regular way.
This research culminated in some results showing that “irregular” programs, i.e. those that are not just loop nests on arrays but are built around other interesting data structures, often have more parallelism available than a conventional autoparallelizing compiler would reveal.
(Side-project with academic publication) Deferred borrows: a proposal to extend Rust’s ownership and borrowing system to allow for logically memory-safe references to items in collections (vectors, hashmaps and the like) without the restrictions that the borrow-system would impose. Logical memory safety means that a reference will always refer to the same item: in particular, it is tied (at the type-system level) to a particular collection, and the collection is constrained in a way that prevents the item from being removed.
These “deferred borrows” build on path-dependent types and provide most of the benefits of true borrows, i.e., non-dangling references to objects in a way that cannot be broken, without imposing onerous limits on which borrows can be simultaneously active (e.g. no more than one mutable borrow at a time).
(Side research project) Autopiper, a high-level synthesis (HLS) tool that compiles a sequential, imperative programming language to a digital logic hardware description in Verilog. This project was an experiment in bridging sequential stateful computation semantics and pipelined hardware with a notion of “transactional state”: the compiler reasons about state across time and pipeline stages, and can generate bypass logic, speculation (and recovery from misspeculation), and maybe eventually automatically perform other optimizations. This was an attempt to formalize a lot of the design patterns that a computer architect working on a high-performance out-of-order microarchitecture would use, effectively “deriving from first principles” a CPU core from ISA semantics. Lofty goal and somewhat unfinished but all the primitives are there!
(Academic research) Heterogeneous Block Architecture, a proposal for a high-performance, energy-efficient CPU microarchitecture design that incorporates a conventional out-of-order superscalar core and a VLIW (very-long instruction word) in-order core that runs prescheduled code based on dynamic recordings of the superscalar core’s activity. The work proposes both a framework to combine multiple execution substrates (the two types of core) into one hybrid core, running a single thread of the user’s program in a transparent way, by the use of dynamically-formed atomic superblocks and hardware to communicate control-flow and livein/liveout values between blocks.
To evaluate this work, I built a very detailed microarchitectural simulator that runs unmodified x86-64 programs. The simulator is an execute-at-execute design, meaning that it splits instructions into uops and sends these through a simulated pipeline with real program values. It performs speculation, computes values down a speculative program path, and recovers from misspeculation as the hardware does. (A simpler approach often used instead is to execute the program normally, record a trace of its instructions, and then feed these into a timing model that only has to be mostly correct and need not handle speculation. This simpler approach is faster but is not quite accurate enough for detailed microarchitectural work.) Unfortunately the full simulator is not publicly available, but I have released x86instlib, the x86 model underlying the decode and execute stages (in turn based on parts of PTLsim).
Dynamic translation-based stack machine: a pipelined CPU in Verilog with special hardware register stack handling to allow execution of a virtual stack machine, together with dynamic translation firmware that JITs the bytecode to the native core. This was an undergrad computer architecture final project.
(Day job) contributions to protobuf (protocol buffers, a serialization format often used as an in-memory data structure as well) as a member of the team at Google. I co-developed arena allocator support and led its launch internally, which involved modifying the (often very hot) generated code while avoiding any performance or heap-space regressions in production workloads; making use of the more efficient allocation infrastructure sometimes led to significant performance wins. The discussions surrounding API semantics and compatibility were especially interesting.
(Side project) speck, the small, portable, efficiently-coded kernel: a hobby operating system developed in 2001—2002, when I was much younger. It runs on bare x86 and implements preemptive multitasking, message passing, and dynamic kernel-module loading.