# Full-Stack Correctness in Wasm Eliminating Bugs Inside and Outside the Sandbox

Chris Fallin *(F5)* Invited Talk, WAW 2025

https://webassembly.org



Overview



https://webassembly.org



Overview

WebAssembly (abbreviated *Wasm*) is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications.





WebAssembly describes a memory-safe, sandboxed execution environment that may even be implemented inside existing JavaScript virtual machines. When embedded in the web, WebAssembly will enforce the same-origin and permissions security policies of the

#### **Announcing the Bytecode Alliance:** Building a secure by default, composable future for WebAssembly

https://bytecodealliance.org/articles/announcing-the-bytecode-alliance

#### **Announcing the Bytecode Alliance:** Building a secure by default, composable future for WebAssembly

secure by default,

encapsulated by default.

This gives us memory isolation between the two modules.

https://bytecodealliance.org/articles/announcing-the-bytecode-alliance

#### **Announcing the Bytecode Alliance:** Building a secure by default, composable future for WebAssembly

secure by default,

encapsulated by default.

This gives us memory isolation between the two modules.



https://bytecodealliance.org/articles/announcing-the-bytecode-alliance

# Wasmtime

#### A fast and secure runtime for WebAssembly

https://wasmtime/dev/

https://cranelift.dev/





# Wasmtime

#### A fast and secure runtime for WebAssembly

strongly focused on correctness and security.

https://wasmtime/dev/

https://cranelift.dev/





#### How WebAssembly Offers Secure **Development through Sandboxing**

Industry experts discuss why and how WebAssembly offers developers a significantly higher security bar than previous technologies.

https://thenewstack.io/how-webassembly-offers-secure-development-through-sandboxing/



#### How WebAssembly Offers Secure **Development through Sandboxing**

Industry experts discuss why and how WebAssembly offers developers a significantly higher security bar than previous technologies.

'It is very secure in that it uses sandboxed memory,

https://thenewstack.io/how-webassembly-offers-secure-development-through-sandboxing/



#### How WebAssembly Offers Secure **Development through Sandboxing**

Industry experts discuss why and how WebAssembly offers developers a significantly higher security bar than previous technologies.

'It is very secure in that it uses sandboxed memory,

"Those things make WebAssembly incredibly more secure as a starting point for development."

https://thenewstack.io/how-webassembly-offers-secure-development-through-sandboxing/



'It is very s

'It is very secure in that it uses sandboxed memory,

It is very secure

# It is very secure

# It is very secure

# Wasm engine and compiler engineers:









• April 21, 2021 was a beautiful morning in California...

- April 21, 2021 was a beautiful morning in California...
- "The daemon keeps segfaulting"

- April 21, 2021 was a beautiful morning in California...
- "The daemon keeps segfaulting" [this never happens]

- April 21, 2021 was a beautiful morning in California...
- "The daemon keeps segfaulting" [this never happens]
- "faults are coming from inside compiled Wasm code"

- April 21, 2021 was a beautiful morning in California...
- "The daemon keeps segfaulting" [this never happens]
- "faults are coming from inside compiled Wasm code"
- "I'm calling an incident"





#### • Summary: a miscompilation could result in a Wasm instance accessing memory addresses 2GiB prior to its linear memory in host address space (!)





#### host address space





4GiB + guard

memory addresses 2GiB prior to its linear memory in host address space (!)

i32.load offset=8

host address space





4GiB + guard

• Summary: a miscompilation could result in a Wasm instance accessing



host address space



| i64       | ••• | rdi,                   |
|-----------|-----|------------------------|
| i32       | add | ecx, # Wasm add        |
| d.i64 v1  |     | <pre># (nothing)</pre> |
| 64 v0, v2 | add | rdi, rcx               |
| 32 v3+8   | MOV | eax, [rdi+8]           |



• Summary: a miscompilation could result in a Wasm instance accessing



host address space



memory addresses 2GiB prior to its linear memory in host address space (!)

| i64       | ••• | rdi,                   |     |
|-----------|-----|------------------------|-----|
| i32       | add | ecx, # Wasm a          | ado |
| d.i64 v1  |     | <pre># (nothing)</pre> |     |
| 64 v0, v2 | add | rdi, rcx               |     |
| 32 v3+8   | MOV | eax, [rdi+8]           |     |

x86-64: implicit zero-extend on all 32-bit insts





• Summary: a miscompilation could result in a Wasm instance accessing



host address space



memory addresses 2GiB prior to its linear memory in host address space (!)

| i64       | rdi,                   |
|-----------|------------------------|
| i32       | add ecx, # Wasm ad     |
| d.i64 v1  | <pre># (nothing)</pre> |
| 64 v0, v2 | add rdi, rcx           |
| 32 v3+8   | mov eax, [rdi+8]       |

x86-64: implicit zero-extend on all 32-bit insts -> optimization: elide uextend









4GiB + guard

```
... rdi, ...
add ecx, ... # Wasm addr
     # (nothing)
add rdi, rcx
mov eax, [rdi+8]
```





4GiB + guard

```
... rdi, ...
add ecx, ... # Wasm addr
     # REGALLOC SPILL
     #
     # REGALLOC RELOAD
add rdi, rcx
mov eax, [rdi+8]
```





4GiB + guard

memory addresses 2GiB prior to its linear memory in host address space (!)

rdi, ... add ecx, ... # Wasm addr mov [rsp+K], rcx # SPILL # mov rcx, [rsp+K] # RELOAD add rdi, rcx mov eax, [rdi+8]





4GiB + guard

memory addresses 2GiB prior to its linear memory in host address space (!)

rdi, ... add ecx, ... # Wasm addr mov [rsp+K], rcx # SPILL # mov rcx, [rsp+K] # RELOAD add rdi, rcx mov eax, [rdi+8]

Optimization: spill/reload actual value width (important for f32/f64 in 128-bit XMM regs)







4GiB + guard

memory addresses 2GiB prior to its linear memory in host address space (!)



Optimization: spill/reload actual value width (important for f32/f64 in 128-bit XMM regs)



- Summary: a miscompilation could result in a Wasm instance accessing
- 1. Optimization: elide 32-to-64 zero-extends on x86-64 use implicit dest widening

memory addresses 2GiB prior to its linear memory in host address space (!)

rdi, ... add ecx, ... # Wasm addr mov [rsp+K], rcx # SPILL # ••• # RELOAD mov rcx, [rsp+K] add rdi, rcx mov eax, [rdi+8]



## CVE = 2021 = 32629

- Summary: a miscompilation could result in a Wasm instance accessing
- 1. Optimization: elide 32-to-64 zero-extends on x86-64 use implicit dest widening
- 2. Optimization: spill only actual value w

|       | add | [rsp-                     | #   | Wasm<br>ecx |   | dr<br>SPILL |
|-------|-----|---------------------------|-----|-------------|---|-------------|
| width | add | #<br>ecx,<br>rdi,<br>eax, | rcx |             | # | RELOA       |



- Summary: a miscompilation could result in a Wasm instance accessing
- 1. Optimization: elide 32-to-64 zero-extends on x86-64 use implicit dest widening
- 2. Optimization: spill only actual value v
- 3. Bug: use upper bits of register when technically undefined per IR->machine mapping

|    | add | #   | Wasm<br>ecx |   | dr<br>SPILL |
|----|-----|-----|-------------|---|-------------|
| ne | add | rcx |             | # | RELOA       |



- Summary: a miscompilation could result in a Wasm instance accessing
- 1. Optimization: elide 32-to-64 zero-extends on x86-64 use implicit dest widening
- 2. Optimization: spill only actual value w
- 3. Bug: use upper bits of register when technically undefined per IR->machin mapping

-> we elided uextend but value is still narrow

memory addresses 2GiB prior to its linear memory in host address space (!)

| ne | add | rdi,<br>ecx,<br>[rsp+<br># | #   |   | r<br>SPILL |
|----|-----|----------------------------|-----|---|------------|
|    | add |                            | rcx | # | RELOA      |



- Summary: a miscompilation could result in a Wasm instance accessing
- 1. Optimization: elide 32-to-64 zero-extends on x86-64 use implicit dest widening
- 2. Optimization: spill only actual value v
- 3. Bug: use upper bits of register when technically undefined per IR->machine mapping
- 4. Questionable choice: sign-extend on reload??

memory addresses 2GiB prior to its linear memory in host address space (!)

| ne | rdi,<br>add ecx, # Wasm a<br>mov [rsp+K], ecx                             |         |
|----|---------------------------------------------------------------------------|---------|
|    | <pre>#<br/>movsx rcx, [rsp+K]<br/>add rdi, rcx<br/>mov eax, [rdi+8]</pre> | # RELOA |



- Summary: a miscompilation could result in a Wasm instance accessing
  - Wasm addr: 3GiB

#### linear memory

host address space

4GiB + guard

memory addresses 2GiB prior to its linear memory in host address space (!)

.... rdi, ... add ecx, ... # Wasm addr mov [rsp+K], ecx # SPILL # movsx rcx, [rsp+K] # RELOAD add rdi, rcx mov eax, [rdi+8]



- Summary: a miscompilation could result in a Wasm instance accessing
  - Wasm addr: 3GiB

linear memory

native offset: -1GiB

host address space

4GiB + guard

memory addresses 2GiB prior to its linear memory in host address space (!)

rdi, ... add ecx, ... # Wasm addr mov [rsp+K], ecx # SPILL # ••• movsx rcx, [rsp+K] # RELOAD add rdi, rcx mov eax, [rdi+8]





• Summary: a miscompilation could result in a Wasm instance accessing



host address space



## memory addresses 2GiB prior to its linear memory in host address space (!)



native offset: -1GiB

#### 4GiB + guard



• Summary: a miscompilation could result in a Wasm instance accessing





## memory addresses 2GiB prior to its linear memory in host address space (!)





• Summary: a miscompilation could result in a Wasm instance accessing

memory addresses 2GiB prior to its linear memory in host address space (!)

- Summary: a miscompilation could result in a Wasm instance accessing

memory addresses 2GiB prior to its linear memory in host address space (!)

 Aftermath: emergency version bump internally; patch release; forcing function to develop our CVE release process in BA / Wasmtime (since exercised more!)



- Summary: a miscompilation could result in a Wasm instance accessing
- How can we avoid ever having this problem again?

memory addresses 2GiB prior to its linear memory in host address space (!)

 Aftermath: emergency version bump internally; patch release; forcing function to develop our CVE release process in BA / Wasmtime (since exercised more!)



- Summary: a miscompilation could result in a Wasm instance accessing
- How can we avoid ever having this problem again\*?

memory addresses 2GiB prior to its linear memory in host address space (!)

 Aftermath: emergency version bump internally; patch release; forcing function to develop our CVE release process in BA / Wasmtime (since exercised more!)



#### CVE-2023-26489

#### CVE-2023-26489

- Aside: it did happen again, two years later

• Summary: base + uextend(index << 3) folded to base + uextend(index) << 3 in x86-64 addressing mode selection; reach up to 34GiB beyond a memory

#### CVE-2023-26489

- Aside: it did happen again, two years later
- Summary: base + uextend(index << 3) folded to base + uextend(index) << 3 in x86-64 addressing mode selection; reach up to 34GiB beyond a memory
- One must imagine Sisyphus verification researchers happy

#### Efficient Software-Based Fault Isolation

Robert Wahbe

Steven Lucco

Susan L. Graham Thomas E. Anderson

One way to provide fault isolation among cooperating software modules is to place each in its own address space. However, for tightly-coupled modules, this solution incurs prohibitive context switch overhead. In this paper, we present a software approach to implementing fault isolation within a single address space.

#### SOSP 1993



#### Efficient Software-Based Fault Isolation

Steven Lucco Robert Wahbe

Thomas E. Anderson Susan L. Graham

SOSP 1993 We put many instances in a single address space, and add software checks inline, to enable fast context switching — essential for many workloads!

#### Efficient Software-Based Fault Isolation

Steven Lucco Thomas E. Anderson Susan L. Graham Robert Wahbe

SOSP 1993 • We put many instances in a single address space, and add software checks inline, to enable fast context switching — essential for many workloads!

• Browser: fast Wasm-to-JS interaction (~native func call) on one webpage

Server-side: extremely dense multi-tenant environments (timeslicing)

#### Efficient Software-Based Fault Isolation

Susan L. Graham Robert Wahbe Steven Lucco Thomas E. Anderson

- This is Wasm's secret superpower (tiny sandboxes <u>nanoprocesses</u>)

SOSP 1993 • We put many instances in a single address space, and add software checks inline, to enable fast context switching — essential for many workloads!

• Browser: fast Wasm-to-JS interaction (~native func call) on one webpage

Server-side: extremely dense multi-tenant environments (timeslicing)

#### Efficient Software-Based Fault Isolation

Susan L. Graham Robert Wahbe Steven Lucco Thomas E. Anderson

- This is Wasm's secret superpower (tiny sandboxes <u>nanoprocesses</u>)
- But we *must trust the compiler*

SOSP 1993 • We put many instances in a single address space, and add software checks inline, to enable fast context switching — essential for many workloads!

• Browser: fast Wasm-to-JS interaction (~native func call) on one webpage

Server-side: extremely dense multi-tenant environments (timeslicing)

• We do a lot to try to ensure correctness

- We do *a lot* to try to ensure correctness
  - Differential fuzzing against: Wasm spec interp, wasmi, V8

- We do a lot to try to ensure correctness
  - Differential fuzzing against: Wasm spec interp, wasmi, V8
  - Fuzzing with symbolic translation validation of register allocation

- We do *a lot* to try to ensure correctness
  - Differential fuzzing against: Wasm spec interp, wasmi, V8
  - Fuzzing with symbolic translation validation of register allocation
  - "Chaos testing" in compiler pipeline

- We do a lot to try to ensure correctness
  - Differential fuzzing against: Wasm spec interp, wasmi, V8
  - Fuzzing with symbolic translation validation of register allocation
  - "Chaos testing" in compiler pipeline
  - trial by fire in real world: Cranelift as rustc backend

- We do a lot to try to ensure correctness
  - Differential fuzzing against: Wasm spec interp, wasmi, V8
  - Fuzzing with symbolic translation validation of register allocation
  - "Chaos testing" in compiler pipeline
  - trial by fire in real world: Cranelift as rustc backend
- Somehow these CVEs still happen occasionally (~0.5 per year)

- We do a lot to try to ensure correctness
  - Differential fuzzing against: Wasm spec interp, wasmi, V8
  - Fuzzing with symbolic translation validation of register allocation
  - "Chaos testing" in compiler pipeline
  - trial by fire in real world: Cranelift as rustc backend
- Somehow these CVEs still happen occasionally (~0.5 per year)
- "I would simply prove the compiler correct"

• "I would simply prove the compiler correct"

• "I would simply prove the compiler correct"

## Challenge Accepted

### Anti-Goal

#### THE COMPCERT C COMPILER

#### 📥 Download CompCert C

Read the manual

CompCert C is a compiler for the C programming language. Its intended use is the compilation of life-critical and mission-critical software written in C and meeting high levels of assurance. It accepts most of the ISO C 99 language, with some exceptions and a few extensions. It produces machine code for the PowerPC, ARM, RISC-V and x86 (32 and 64 bits) architectures. Performance of the generated code is decent but not outstanding: on PowerPC, about 90% of the performance of GCC version 4 at optimization level 1.

What sets CompCert C apart from any other production compiler, is that it is *formally verified*, using machine-assisted mathematical proofs, to be exempt from *miscompilation* issues. In other words, the executable code it produces is proved to behave exactly as specified by the semantics of the source C program. This level of confidence in the

### Anti-Goal

#### THE COMPCERT C COMPILER

#### 📥 Download CompCert C

Read the r

CompCert C is a compiler for the C programming la compilation of life-critical and mission-critical soft levels of assurance. It accepts most of the ISO C 9 few extensions. It produces machine code for the bits) architectures. Performance of the generated PowerPC, about 90% of the performance of GCC ve

What sets CompCert C apart from any other produ verified, using machine-assisted mathematical pro issues. In other words, the executable code it proc specified by the semantics of the source C program

#### About

CakeML is a functional programming language and an ecosystem of proofs and tools built around the language. The ecosystem includes a proven-correct compiler that can bootstrap itself.

#### CAKEML **A Verified Implementation of ML**



#### Anti-Goal

#### THE COMPCERT C COMPILER

📥 Download CompCert C

Read the manual

CompCert C is a compiler for the C programming language. Its intended use is the compilation of life-critical and mission-critical software written in C and meeting high levels of assurance. It accepts most of the ISO C 99 language, with some exceptions and a few extensions. It produces machine code for the PowerPC, ARM, RISC-V and x86 (32 and 64 bits) architectures. Performance of the generated code is decent but not outstanding: on PowerPC, about 90% of the performance of GCC version 4 at optimization level 1.

What sets CompCert C apart from any other production compiler, is that it is *formally* verified, using machine-assisted mathematical proofs, to be exempt from miscompilation issues. In other words, the executable code it produces is proved to behave exactly as specified by the semantics of the source C program. This level of confidence in the

- demands on us; can't afford ~engineer-century of work)



#### About

CakeML is a functional programming language and an ecosystem of proofs and tools built around the language. The ecosystem includes a proven-correct compiler that can bootstrap itself.

#### Engineered from scratch for verification (we have ~200KLoC existing code)

Optimizations limited by provability (we don't want to limit perf too much)

Enormous manual effort (we're a tiny team and verification is one of many

### Potential Goals?

thoroughly?

end-to-end?

#### Can we verify a part of our compiler (where bugs are more common) more

Can we verify limited properties of the code (e.g. linear memory sandboxing)

### Potential Goals?

thoroughly?

 $\rightarrow$  SMT on instruction selector rules (ASPLOS'24)

end-to-end?

 $\rightarrow$  Proof-carrying code (ongoing)

Can we verify a part of our compiler (where bugs are more common) more

Can we verify limited properties of the code (e.g. linear memory sandboxing)

#### Outline

- Formal Verification in Instruction Selection
- Proof-Carrying Code for Sandboxing Logic
- Guest-Code Correctness

## Instruction Lowering Verification

#### BA RFC 15: ISLE instruction-selection (pattern-matching) DSL, Aug 2021

# Discussion: Future Implications for Verification

Though we have not yet worked out all the details, we are confident that the translation of rules expressed in the ISLE DSL into some machine-readable form for formal verification efforts should be possible. This is primarily because of the "equivalent-value" semantics that are inherent in a term-rewriting system. The denotational value of a term is the symbolic or concrete value produced by the instruction it represents (depending on the interpretation); so "all" we have to do is to write, e.g., pre/post-conditions for some SMT-solver or theorem-prover that describe the semantics of instruction terms on either side of the translation.

### Instruction Lowering Verification

#### BA RFC 18: Cranelift roadmap for 2022 (Dec 2021)

In the next year we should attempt to find some concrete ways to achieve formal verification of some part of the compiler. The instruction lowerings are the obvious choice, now that we have ISLE.

### Instruction Lowering Verification

 Dec 2021: contact from both Alexa VanHattum and Fraser Brown (+ Alexa's advisor Adrian Sampson and Fraser's student Monica Pardeshi)

Dec 2021: contact from both Alexa VanHattum and Fraser Brown Monica Pardeshi)



## (+ Alexa's advisor Adrian Sampson and Fraser's student

Academic collaboration acquired; let's go!

### Lightweight, Modular Verification for WebAssembly-to-Native Instruction Selection

Alexa VanHattum Wellesley College Wellesley, MA, USA av111@wellesley.edu Monica Pardeshi Carnegie Mellon University Pittsburgh, PA, USA mpardesh@andrew.cmu.edu

Adrian Sampson

Cornell University Ithaca, NY, USA asampson@cs.cornell.edu Chris Fallin Fastly San Francisco, CA, USA cfallin@fastly.com

Fraser Brown Carnegie Mellon University Pittsburgh, PA, USA fraserb@andrew.cmu.edu

### ASPLOS 2024

| 1 | (rule   |
|---|---------|
| 2 | (lower  |
| 3 | (a64_rc |

### (rotr <mark>x y</mark>)) otr I64 <mark>x y</mark>))

| 1 | (rule   |
|---|---------|
| 2 | (lower  |
| 3 | (a64_rc |

### Cranelift IR (CLIF)

rotr (rotate right)

### (rotr <mark>x y</mark>)) otr I64 <mark>x y</mark>))

| 1 | (rule   |
|---|---------|
| 2 | (lower  |
| 3 | (a64_rc |

### Cranelift IR (CLIF)

rotr (rotate right)

### (rotr <mark>x y</mark>)) otr I64 <mark>x y</mark>))

### aarch64 machine code

rotr (rotate right)

| 1 | (rule   |
|---|---------|
| 2 | (lower  |
| 3 | (a64_rc |

### Cranelift IR (CLIF)

### rotr (rotate right)

SMT (theory of bitvectors)

### (rotr <mark>x y</mark>)) otr I64 <mark>x y</mark>))

### aarch64 machine code rotr (rotate right) ↓ SMT (theory of bitvectors)

| 1 | (rule   |
|---|---------|
| 2 | (lower  |
| 3 | (a64_rc |

### Cranelift IR (CLIF)

### rotr (rotate right)

SMT (theory of bitvectors)

### (rotr <mark>x y</mark>)) otr I64 <mark>x y</mark>))

# aarch64 machine code rotr (rotate right) \$\$MT (theory of bitvectors)

| 1 | (rule   |
|---|---------|
| 2 | (lower  |
| 3 | (a64_rc |

### Cranelift IR (CLIF)

### rotr (rotate right)

SMT (theory of bitvectors)

### (rotr <mark>x y</mark>)) otr I64 <mark>x y</mark>))

### aarch64 machine code rotr (rotate right) ↓ SMT (theory of bitvectors)

(or counterexample)

### (lower (has\_type \$I64 (rotr x y)) ...)

lower has\_type value\_def Instruction

### InstructionData.BinaryOp (Op.Rotr)



### lower has\_type value def InstructionData.BinaryOp (Op.Rotr)

(a64 rotr x y)



### lower has\_type value def InstructionData.BinaryOp (Op.Rotr)

### InstResult.Inst Inst.AluRRR (AluOp.Rotr)



### lower has\_type value def InstructionData.BinaryOp (Op.Rotr)

InstResult.Inst Inst.AluRRR (AluOp.Rotr) put\_in\_reg x put in reg y



u64\_from\_imm64 1 InstructionData.BinaryOp (Op.Add) InstructionData.Const value\_def value\_def

> InstructionData.BinaryOp (Op.Mul) value\_def

- has\_type
  - lower
- InstResult.Inst
- Inst.AluRRR (AluOp.Madd)
- put\_in\_reg x put\_in\_reg y

put\_in\_reg z

some node lower some node some node

some node

some node

some node

some node some node

some node some node

some node some node some node some node some node some node some node some node some node some node some node some node some node some node some node some node some node some node some node lower some node some node some node some node some node some node some node

### Rust FFI (instruction emit primitives)

### Rust FFI (IR accessor primitives)

some node



### Rust FFI (IR accessor primitives)

some node some node spec some node some node some node spec some node some node spec some node some node lower spec some node some node spec some node some node

Rust FFI (instruction emit primitives)

| 2321 | (spec (cmp ty x         |
|------|-------------------------|
| 2322 | (provide (= r           |
| 2323 | (require                |
| 2324 | ( <mark>or</mark> (= ty |

y) result (subs ty x y)))

32) (= ty 64))))

cranelift/codegen/src/isa/aarch64/inst.isle



- Lots more to actually make this work!
  - Type-polymorphism in rules —> "instantiate" at concrete widths
  - Type-inference to use narrower bitvectors
  - Full system of specifying "model domain" values for ISLE values
  - Good ergonomics around showing counterexamples

- It finds real bugs
  - Reproduced x86-64 amode bug (CVE-2023-26489)
  - Arithmetic edge cases in divides, count-leading-sign of narrow values, boolean simplification rules, ...
  - Real counterexamples are invaluable
  - Ongoing extension work (especially: tying to real ISA semantics)
  - Ongoing discussions on how to integrate into our workflow to keep verified

But... can we verify something end-to-end? 

Wasm bytecode



### CLIF

# Instruction Selection

### VCode

# Allocation Register

### Machine Code

But... can we verify something end-to-end? 

Wasm bytecode



CLIF

# Instruction Selection

VCode

Allocation Register

Machine Code

Verified (in progress)

But... can we verify something end-to-end? 

Wasm bytecode



Verified (in progress) Translation Validation

But... can we verify something end-to-end? 

Wasm bytecode



(in progress)

Validation

But... can we verify something end-to-end? 

Wasm bytecode



+ integration/glue bugs!

Verified (in progress) Translation Validation

• "Prove the compiler correct" is Hard(tm)

- "Prove the compiler correct" is Hard(tm)
- Can we prove that all memory accesses in machine code access Wasm memories with valid bounds-checking (or other internal VM data)?

- "Prove the compiler correct" is Hard(tm)
- Can we prove that all memory accesses in machine code access Wasm memories with valid bounds-checking (or other internal VM data)?
- Build an *independent checker* that operates on machine code: compiler no longer in the TCB (!)

- "Prove the compiler correct" is Hard(tm)
- Can we prove that all memory accesses in machine code access Wasm memories with valid bounds-checking (or other internal VM data)?
- Build an independent checker that operates on machine code: compiler no longer in the TCB (!) NDSS 2021

Evan Johnson<sup>†</sup> David Thien<sup>†</sup> Yousef Alhessi<sup>†</sup> Shravan Narayan<sup>†</sup> Fraser Brown<sup>\*</sup> Sorin Lerner<sup>†</sup> Tyler McMullen<sup>\$</sup> Stefan Savage<sup>†</sup> Deian Stefan<sup>†</sup> <sup>♦</sup>Fastly Labs <sup>†</sup>UC San Diego \*Stanford

Доверя́й, но проверя́й: SFI safety for native-compiled Wasm

### Доверя́й, но проверя́й: SFI safety for native-compiled Wasm

Evan Johnson<sup>†</sup> David Thien<sup>†</sup> Yousef Alhessi<sup>†</sup> Shravan Narayan<sup>†</sup> Fraser Brown<sup>\*</sup> Sorin Lerner<sup>†</sup> Tyler McMullen<sup>¢</sup> Stefan Savage<sup>†</sup> Deian Stefan<sup>†</sup> <sup>†</sup>UC San Diego <sup>\*</sup>Stanford <sup>¢</sup>Fastly Labs

• It even operates on Cranelift!

### NDSS 2021

### Доверя́й, но проверя́й: SFI safety for native-compiled Wasm

Evan Johnson<sup>†</sup> David Thien<sup>†</sup> Yousef Alhessi<sup>†</sup> Shravan Narayan<sup>†</sup> Fraser Brown<sup>\*</sup> Sorin Lerner<sup>†</sup> Tyler McMullen<sup>¢</sup> Stefan Savage<sup>†</sup> Deian Stefan<sup>†</sup> <sup>†</sup>UC San Diego <sup>\*</sup>Stanford <sup>\*</sup>Fastly Labs

- It even operates on Cranelift!
- ... but not Wasmtime (older Lucet runtime)

### NDSS 2021

### Доверя́й, но проверя́й: SFI safety for native-compiled Wasm

Evan Johnson<sup>†</sup> David Thien<sup>†</sup> Yousef Alhessi<sup>†</sup> Shravan Narayan<sup>†</sup> Fraser Brown<sup>\*</sup> Sorin Lerner<sup>†</sup> Tyler McMullen<sup>¢</sup> Stefan Savage<sup>†</sup> Deian Stefan<sup>†</sup> <sup>†</sup>UC San Diego <sup>\*</sup>Stanford <sup>°</sup>Fastly Labs

- It even operates on Cranelift!
- ... but not Wasmtime (older Lucet runtime)
- ... and on the output of a much older (poorly optimizing) Cranelift

### NDSS 2021

### VeriWasm

stack safety (accesses to stackframes) — focus here on Wasm heap

Verify memory safety (Wasm heap, funcref table), control-flow safety, and

## VerWasm

- stack safety (accesses to stackframes) focus here on Wasm heap

# rdi: heap base add eax, ... # an i32 Wasm address mov rbx, [rdi+rax+0x100]

# Verify memory safety (Wasm heap, funcref table), control-flow safety, and

• Key idea: lattice-based abstract interpretation over machine registers

## VeriWasm

- stack safety (accesses to stackframes) focus here on Wasm heap

# rdi: heap base add eax, ... # an i32 Wasm address mov rbx, [rdi+rax+0x100]

Verify memory safety (Wasm heap, functed table), control-flow safety, and

• Key idea: lattice-based abstract interpretation over machine registers rdi: HeapBase

## VeriWasm

- stack safety (accesses to stackframes) focus here on Wasm heap

rdi: HeapBase # rdi: heap base add eax, ... # an i32 Wasm address rax: Bounded4GB mov rbx, [rdi+rax+0x100]

Verify memory safety (Wasm heap, funcref table), control-flow safety, and

• Key idea: lattice-based abstract interpretation over machine registers

- stack safety (accesses to stackframes) focus here on Wasm heap

# rdi: heap base add eax, ... # an i32 Wasm address mov rbx, [rdi+rax+0x100]

Verify memory safety (Wasm heap, funcref table), control-flow safety, and

• Key idea: lattice-based abstract interpretation over machine registers

rdi: HeapBase rax: Bounded4GB access to

HeapBase + Bounded4GB -> valid heap address



- stack safety (accesses to stackframes) focus here on Wasm heap
- Update to modern Cranelift + Wasmtime?

Verify memory safety (Wasm heap, funcref table), control-flow safety, and

- Verify memory safety (Wasm heap, funcref table), control-flow safety, and stack safety (accesses to stackframes) — focus here on Wasm heap
- Update to modern Cranelift + Wasmtime?
  - Prototyped after 2021 CVE for *limited* domain (one memory, no dynamic bounds checking)
  - 30% compile-time overhead

- Verify memory safety (Wasm heap, funcref table), control-flow safety, and stack safety (accesses to stackframes) — focus here on Wasm heap
- Update to modern Cranelift + Wasmtime?
  - Prototyped after 2021 CVE for *limited* domain (one memory, no dynamic bounds checking)
  - 30% compile-time overhead
  - What about 2023, and full production feature support?

- stack safety (accesses to stackframes) - focus here on Wasm heap
- Update to modern Cranelift + Wasmtime?
  - Multiple memories and tables

Verify memory safety (Wasm heap, funcref table), control-flow safety, and

- stack safety (accesses to stackframes) - focus here on Wasm heap
- Update to modern Cranelift + Wasmtime?
  - Multiple memories and tables

### Lucet

•

r d l

Single heap (4GiB + guard)

vmctx (globals, misc state)

Verify memory safety (Wasm heap, funcref table), control-flow safety, and

- stack safety (accesses to stackframes) - focus here on Wasm heap
- Update to modern Cranelift + Wasmtime?
  - Multiple memories and tables



Verify memory safety (Wasm heap, funcref table), control-flow safety, and

Wasmtime

vmctx (globals, memories, tables)

- stack safety (accesses to stackframes) focus here on Wasm heap
- Update to modern Cranelift + Wasmtime?
  - Multiple memories and tables



# Verify memory safety (Wasm heap, funcref table), control-flow safety, and

### 4GiB + guard

- stack safety (accesses to stackframes) focus here on Wasm heap
- Update to modern Cranelift + Wasmtime?
  - Multiple memories and tables



# Verify memory safety (Wasm heap, funcref table), control-flow safety, and

### 4GiB + guard

- stack safety (accesses to stackframes) focus here on Wasm heap
- Update to modern Cranelift + Wasmtime?
  - shared vs. non-shared, different guard region sizes, ...

Verify memory safety (Wasm heap, funcref table), control-flow safety, and

Multiple memories and tables — import vs. inline, dynamic vs. statlc,

- stack safety (accesses to stackframes) focus here on Wasm heap
- Update to modern Cranelift + Wasmtime?
  - shared vs. non-shared, different guard region sizes, ...
  - Also, the optimizer got better!

Verify memory safety (Wasm heap, funcref table), control-flow safety, and

Multiple memories and tables — import vs. inline, dynamic vs. statlc,

Linear-time and -space verification

Linear-time and -space verification

add eax, ... mov rbx, [r8+rax] add r10, rax cmp rax, r9 cmovae r10, <zero'd reg> mov rcx, [r10] add r12, rax cmp rax, r11 cmovae r12, <zero'd reg> mov rdx, [r12]

Linear-time and -space verification

add eax, ... mov rbx, [r8+rax] add r10, rax cmp rax, r9 cmovae r10, <zero'd reg> mov rcx, [r10] add r12, rax cmp rax, r11 cmovae r12, <zero'd reg> mov rdx, [r12]

### Compute an i32

Linear-time and -space verification

add eax, ... mov rbx, [r8+rax] add r10, rax cmp rax, r9 cmovae r10, <zero'd reg> mov rcx, [r10] add r12, rax cmp rax, r11 cmovae r12, <zero'd reg> mov rdx, [r12]

### Compute an i32 Load from 4GiB-guard mem



Linear-time and -space verification 

> add eax, ... mov rbx, [r8+rax] add r10, rax cmp rax, r9 cmovae r10, <zero'd reg> mov rcx, [r10] add r12, rax cmp rax, r11 cmovae r12, <zero'd reg> mov rdx, [r12]

Compute an i32 Load from 4GiB-guard mem Bounds-check (Spectre) Load from dynamic mem



Linear-time and -space verification 

> add eax, ... mov rbx, [r8+rax] add r10, rax cmp rax, r9 cmovae r10, <zero'd reg> mov rcx, [r10] add r12, rax cmp rax, r11 cmovae r12, <zero'd reg> mov rdx, [r12]

Compute an i32 Load from 4GiB-guard mem Bounds-check (Spectre) Load from dynamic mem

How do we describe rax in the abstract domain??



Linear-time and -space verification 

> add eax, ... mov rbx, [r8+rax] add r10, rax cmp rax, r9 cmovae r10, <zero'd reg> mov rcx, [r10] add r12, rax cmp rax, r11 cmovae r12, <zero'd reg>

mov rdx, [r12] How do we describe rax in the abstract domain?? rax < 4GiB && rax < r9 && rax < r11

Compute an i32 Load from 4GiB-guard mem Bounds-check (Spectre) Load from dynamic mem









Linear-time and -space verification

add eax, ... mov rbx, [r8+rax] add r10, rax

Quadratic behavior!

cmovae r12, <zero'd reg> mov rdx, [r12] How do we describe rax in the abstract domain?? rax < 4GiB && rax < r9 && rax < r11

### Compute an i32 Load from 4GiB-guard mem



heck (Spectre)

dynamic mem









Linear-time and -space verification

add eax, ... mov rbx, [r8+rax] cmp rax, r9 cmovae r10, <zero'd reg>

Two separate parts combined later



Compute an i32







- Linear-time and -space verification
- Portable across ISAs

- Linear-time and -space verification
- Portable across ISAs
  - At least x86-64 and aarch64 (equally important production targets)

- Linear-time and -space verification
- Portable across ISAs
  - At least x86-64 and aarch64 (equally important production targets)
  - With most logic platform-independent

# ally important production targets)

- Linear-time and -space verification
- Portable across ISAs
  - At least x86-64 and aarch64 (equally important production targets)
  - With most logic platform-independent
    - ISA-specific work should encode instruction semantics, but that's it:

- Linear-time and -space verification
- Portable across ISAs
  - At least x86-64 and aarch64 (equally important production targets)
  - With most logic platform-independent
    - ISA-specific work should encode instruction semantics, but that's it:

mov rax,  $[r8 + 8*r9] \rightarrow rax = load(add(r8, scale(r9, 8)))$ 

- Linear-time and -space verification
- Portable across ISAs
  - At least x86-64 and aarch64 (equally important production targets)
  - With most logic platform-independent
    - ISA-specific work should encode instruction semantics, but that's it: mov rax, [r8 + 8\*r9] -> rax = load(add(r8, scale(r9, 8)))
      Idr x20, [x19, w20, uxtw] -> x20 = load(add(x19, uextend(x20, 32, 64)))

- Linear-time and -space verification
- Portable across ISAs
- Easy to keep up-to-date as optimizer is modified

- Linear-time and -space verification
- Portable across ISAs
- Easy to keep up-to-date as optimizer is modified
  - Adding clever rewrites might require more domain knowledge encoded

- Linear-time and -space verification
- Portable across ISAs
- Easy to keep up-to-date as optimizer is modified
  - Adding clever rewrites might require more domain knowledge encoded
  - ... but we must not have to modify individual rules or passes to work with the verifier

- Linear-time and -space verification
- Portable across ISAs
- Easy to keep up-to-date as optimizer is modified
- Prove safety of all memory loads+stores

- Linear-time and -space verification
- Portable across ISAs
- Easy to keep up-to-date as optimizer is modified
- Prove safety of all memory loads+stores
  - Easily delineate our safety condition: "loads and stores occur according to some description/understanding of the runtime's data layout"

- Linear-time and -space verification
- Portable across ISAs
- Easy to keep up-to-date as optimizer is modified
- Prove safety of all memory loads+stores
  - Easily delineate our safety condition: "loads and stores occur according to some description/understanding of the runtime's data layout"
    - This description is in the TCB; and the runtime (e.g. memory.grow) is; but the compiler is not

- Linear-time and -space verification
- Portable across ISAs
- Easy to keep up-to-date as optimizer is modified
- Prove safety of all memory loads+stores
- Fast enough to run in production (translation validation on all compilations)

# Spoiler: Work-in-Progress

- I've tried ~4 approaches; each time getting closer(?) on dynamic memories
  - What does work: static memories (like VeriWasm), over new Wasmtime data structures and Cranelift optimizations; 1% compile-time overhead

# Spoiler: Work-in-Progress

- I've tried ~4 approaches; each time getting closer(?) on dynamic memories
  - What does work: static memories (like VeriWasm), over new Wasmtime data structures and Cranelift optimizations; 1% compile-time overhead
  - First: dynamic and static bounds, separately

# Spoiler: Work-in-Progress

- I've tried ~4 approaches; each time getting closer(?) on dynamic memories
  - What does work: static memories (like VeriWasm), over new Wasmtime data structures and Cranelift optimizations; 1% compile-time overhead
  - First: dynamic and static bounds, separately -> nope, GVN can combine

- I've tried ~4 approaches; each time getting closer(?) on dynamic memories
  - What does work: static memories (like VeriWasm), over new Wasmtime data structures and Cranelift optimizations; 1% compile-time overhead
  - First: dynamic and static bounds, separately -> nope, GVN can combine
  - Second: lattice that includes both kinds of bounds -> nope, multiple dynamic memories

- I've tried ~4 approaches; each time getting closer(?) on dynamic memories
  - What does work: static memories (like VeriWasm), over new Wasmtime data structures and Cranelift optimizations; 1% compile-time overhead
  - First: dynamic and static bounds, separately -> nope, GVN can combine
  - Second: lattice that includes both kinds of bounds -> nope, multiple dynamic memories
  - Third: Set-of-upper-and-lower-bounds -> nope, not scalable

- I've tried ~4 approaches; each time getting closer(?) on dynamic memories
  - What does work: static memories (like VeriWasm), over new Wasmtime data structures and Cranelift optimizations; 1% compile-time overhead
  - First: dynamic and static bounds, separately -> nope, GVN can combine
  - Second: lattice that includes both kinds of bounds -> nope, multiple dynamic memories
  - Third: Set-of-upper-and-lower-bounds -> nope, not scalable
  - Fourth: inequality solver (matrices + Gaussian reduction) -> nope, not scalable



- I've tried ~4 approaches; each time getting closer(?) on dynamic memories
  - What does work: static memories (like VeriWasm), over new Wasmtime data structures and Cranelift optimizations; 1% compile-time overhead
- I think I have something that will work, with a trick

- I've tried ~4 approaches; each time getting closer(?) on dynamic memories
  - What does work: static memories (like VeriWasm), over new Wasmtime data structures and Cranelift optimizations; 1% compile-time overhead
- I think I have something that will work, with a trick
- This is a workshop talk, after all!

- - George C. Necula
- School of Computer Science
- Carnegie Mellon University
- Pittsburgh, Pennsylvania 15213–3891
  - necula@cs.cmu.edu
- analysis of binary artifact



#### Proof-Carrying Code

#### POPL 1997

#### Key idea: compiler emits proof steps to check — simpler than from-scratch

- - George C. Necula
- School of Computer Science
- Carnegie Mellon University
- Pittsburgh, Pennsylvania 15213–3891
  - necula@cs.cmu.edu
- analysis of binary artifact
- Think of it like "typed assembly" + type-preserving compilation



#### Proof-Carrying Code

#### POPL 1997

#### Key idea: compiler emits proof steps to *check* — simpler than from-scratch

function u0:0(i64 vmctx, i64) fast { gv3 ! mem(mt0, 0x0, 0x0) = vmctx

> mt0 = struct 88 $mt1 = memory 0 \times 180000000$

block7(v0: i64, v1: i64):  $v^2 = ireduce.i^32 v^1$ v3 ! range(64, 0x0, 0xfffffff) = uextend.i64 v2v4 ! mem(mt1,  $0 \times 0$ ,  $0 \times 0$ ) = global value.i64 gv4 v5 ! mem(mt1, 0x0, 0xfffffff) = iadd v4, v3v6 = load.f64 little checked heap v5

- gv4 ! mem(mt1, 0x0, 0x0) = load.i64 notrap aligned readonly checked gv3+80
  - { 80: i64 readonly ! mem(mt1, 0x0, 0x0) }

function u0.0(i64 vmctx, i64) fast {

mt0 = struct 88 $mt1 = memory 0 \times 180000000$ 

```
block7(v0: i64, v1: i64):
    v^2 = ireduce.i^32 v^1
    v3 ! range(64, 0x0, 0xfffffff) = uextend.i64 v2
    v4 ! mem(mt1, 0 \times 0, 0 \times 0) = global value.i64 gv4
    v5 ! mem(mt1, 0x0, 0xfffffff) = iadd v4, v3
    v6 = load.f64 little checked heap v5
```

#### Given fact: first arg is vmctx

- gv3 ! mem(mt0, 0x0, 0x0) = vmctx gv4 ! mem(mt1, 0x0, 0x0) = load.i64 notrap aligned readonly checked gv3+80
  - { 80: i64 readonly ! mem(mt1, 0x0, 0x0) }

function u0:0(i64 vmctx, i64) fast { gv3 ! mem(mt0, 0x0, 0x0) = vmctxgv4 ! mem(mt1, 0x0, 0x0) = load.i64 notrap alignedreadonly checked gy2+80 mtO = struct 88{ 80: i64 readonly ! mem(mt1, 0x0, 0x0) }

 $\mathsf{mll} = \mathsf{memory} \quad \mathsf{wxrowwwww}$ 

block7(v0: i64, v1: i64): v2 = ireduce.i32 v1v3 ! range(64, 0x0, 0xfffffff) = uextend.i64 v2v4 ! mem(mt1,  $0 \times 0$ ,  $0 \times 0$ ) = global value.i64 gv4 v5 ! mem(mt1, 0x0, 0xfffffff) = iadd v4, v3v6 = load.f64 little checked heap v5

#### "memory types" describe layout

function u0:0(i64 vmctx, i64) fast { gv3 ! mem(mt0. 0x0. 0x0) = vmctx

mt0 = struct{ 80: i64 readonly  $mt1 = memory 0 \times 180000000$ 

block7(v0: i64, v1: i64): v2 = ireduce.i32 v1v3 ! range(64, 0x0, 0xffffff) = uextend.i64 v2v4 ! mem(mt1,  $0 \times 0$ ,  $0 \times 0$ ) = global value.i64 gv4 v5 ! mem(mt1, 0x0, 0xfffffff) = iadd v4, v3v6 = load.f64 little checked heap v5



mem(mt1, 0x0, 0x0)

#### facts on fields checked when loads are validated



function u0:0(i64 vmctx, i64) fast { gv3 ! mem(mt0, 0x0, 0x0) = vmctxgv4 ! mem(mt1, 0x0, 0x0) = load.i64 notrap aligned

mt0 = struct 88 $mt1 = memory 0 \times 180000000$ 

block7(v0: i64, v1: i64): iroduco i27 v1 V2 v3 ! range(64, 0x0, 0xfffffff) = uextend.i64 v2 mt1, 0x0, 0x0) – global value.i64 gv4 v5 ! mem(mt1, 0x0, 0xfffffff) = iadd v4, v3v6 = load.f64 little checked heap v5

- readonly checked gv3+80
- { 80: i64 readonly ! mem(mt1, 0x0, 0x0) }

*implicitly-validated* fact based on range



function u0:0(i64 vmctx, i64) fast { gv3 ! mem(mt0, 0x0, 0x0) = vmctx

> mt0 = struct 88 $mt1 = memory 0 \times 180000000$

block7(v0: i64, v1: i64):  $v^2 = ireduce.i^32 v^1$ v3 ! range(64, 0x0, 0xfffffff) = uextend.i64 v2 v4 ! mem(mt1, 0x0, 0x0) - global\_value.i64 gv4
v5 ! mem(mt1, 0x0, 0xffffffff) = iadd v4, v3 LUAU. 164 LILLE CHECKEU HEAP V5 v6

- gv4 ! mem(mt1, 0x0, 0x0) = load.i64 notrap aligned readonly checked gv3+80
  - { 80: i64 readonly ! mem(mt1, 0x0, 0x0) }

abstract-domain add operation

function u0:0(i64 vmctx, i64) fast { gv3 ! mem(mt0, 0x0, 0x0) = vmctx

> mt0 = struct 88 $mt1 = memory 0 \times 180000000$

block7(v0: i64, v1: i64):  $v^2 = ireduce.i^32 v^1$ v3 ! range(64, 0x0, 0xfffffff) = uextend.i64 v2v4 ! mem(mt1  $0 \times 0$   $0 \times 0$ ) = global value.i64 gv4 mem(mt1,  $0 \times 0$ ,  $0 \times fffffff$ ) = iadd v4, v3 V 5 v6 = load.f64 little checked heap v5

#### gv4 ! mem(mt1, 0x0, 0x0) = load.i64 notrap aligned readonly checked gv3+80

#### $\{ 80: i64 readonly ! mem(mt1, 0x0, 0x0) \}$

checked load permitted only when offset in-bounds for memory type (here 4GiB)





```
= global value.i64 gv1
                                            = global value.i64 gv2
                                            = icmp.i\overline{64} uge v2, v4
                                            = iadd.i64 v3, v2
v8 ! dynamic_mem(mt1, 0, gv2-1, nullable) = select_spectre_guard v5, v7, v6
                                             = load.i64 checked v8
```



```
= uextend.i64 v1
                                           = global value.i64 gv1
                                           = global value.i64 gv2
                                           = icmp.i\overline{64} uge v2, v4
                                           = iadd.i64 v3, v2
v8 ! dynamic mem(mt1, 0, gv2-1, nullable) = select spectre guard v5, v7, v6
                                           = load.i64 checked v8
```





Too many pieces to put together: conselect operator

Too many pieces to put together: compare; symbolic addr; symbolic bound;



- select operator
- optimizer

• Too many pieces to put together: compare; symbolic addr; symbolic bound;

Quadratic behavior arises from combination of these pieces when merged by



• Insight: if you can't solve the problem, change the problem

(carry through a "bounds-check" operator in the IR to a pseudo-machine-inst)





Insight: if you can't solve the problem, change the problem

(carry through a "bounds-check" operator in the IR to a pseudo-machine-inst)





- Insight: if you can't solve the problem, change the problem
- validation status

(carry through a "bounds-check" operator in the IR to a pseudo-machine-inst)

• Subtle but important impact: separate value identity for property with separate



- $v^2 = uextend.i64 v^1$
- v3 = global value.i64 gv1
- v4 = global value.i64 gv2
- v5 = dynamic bound.i64 v3, v2, v4

```
v6 = load.i64 checked v5
return v6
```

```
mov rax, ...
mov rsi, [rdi+...]
mov rcx, [rdi+...]
xor r8, r8
add rsi, rax
cmp rax, rcx
cmovae rsi, r8 ;; zero if out-of-bounds
mov rax, [rsi]
```

Emit dynamic bound as "pseudoinstruction" (bundled machine instructions) and check as one unit: can show that combined semantics correspond



- Can we do translation validation on regalloc separately?

We've verified only up to virtual register code (VCode) — regalloc still in TCB

add v0, v1 mov v3, [v2+v0\*8] mov [v4+v5], v3 ret

Register allocation: provide *abstraction* over real machine instructions with *virtual registers* 





add rax, rcx
mov r8, [r9+rax\*8]
mov [r10+r11], r8
ret



- Scan forward through code

add rax, rcx mov r8, [r9+rax\*8] mov [r10+r11], r8 ret

- Track "contents" of each register - Validate each arg gets expected vreg



- Scan forward through code

add rax, rcx mov r8, [r9+rax\*8] mov [r10+r11], r8 ret

input:  $rax = \{v0\}, rcx = \{v1\}, r9 = \{v2\},$  $r10 = \{v4\}, r11 = \{v5\}$  $rax = \{v0 (update)\}$  $r8 = {v3}$ 

- Track "contents" of each register - Validate each arg gets expected vreg



Aside: this is a fantastically effective way to write a new register allocator

- Aside: this is a fantastically effective way to write a new register allocator
  - Production regalloc often involves a lot of heuristics and edge-cases with funny constraints
  - I could not have found and resolved all edge-cases without it

- Aside: this is a fantastically effective way to write a new register allocator
  - Production regalloc often involves a lot of heuristics and edge-cases with funny constraints
  - I could not have found and resolved all edge-cases without it
  - So effective that this is the only test method for regalloc2 (no static suite)
    - We've never found a miscompilation due to RA in production in ~3 years



#### WebAssembly is Secure!





Sandbox boundary:

#### WebAssembly is Secure!



#### Sandbox boundary:

- Formal verification of instruction sel
- Translation validation of key parts (regalloc, memory sandboxing?)

#### WebAssembly is Secure!



#### Sandbox boundary:

- Formal verification of instruction sel
- Translation validation of key parts (regalloc, memory sandboxing?)



#### ... WebAssembly is Secure?



Sandbox boundary:

- Formal verification of instruction sel
- Translation validation of key parts (regalloc, memory sandboxing?)

#### ?? Secure?

#### WebAssembly is Secure?

What is the *threat model*? 



- What is the *threat model*?
  - Code attempting to exceed permissions on system: OK!



- What is the *threat model*?
  - Code attempting to exceed permissions on system: OK!
  - Code attempting to exceed permissions inside guest



- What is the *threat model*?
  - Code attempting to exceed permissions on system: OK!
  - Code attempting to exceed permissions *inside guest* 
    - Exploit runtime / language implementation bugs to...
      - …Observe other requests' data



- What is the *threat model*?
  - Code attempting to exceed permissions on system: OK!
  - Code attempting to exceed permissions *inside guest* 
    - Exploit runtime / language implementation bugs to...
      - ... Observe other requests' data
      - ... Subvert authorization logic
      - ...Inject malicious content
- We still need a correct language implementation for application security



#### Cloudbleed :=

Talk Article

From Wikipedia, the free encyclopedia

**Cloudbleed** was a **Cloudflare** buffer overflow disclosed by Project Zero on February 17, 2017. Cloudflare's code disclosed the contents of memory that contained the private information of other customers, such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data.<sup>[1]</sup> As a



#### 5 languages ~ ΣV

Tools 🗸

#### Cloudbleed :=

Talk Article

From Wikipedia, the free encyclopedia

**Cloudbleed** was a **Cloudflare buffer overflow** disclosed by Project Zero on February 17, 2017. Cloudflare's code disclosed the contents of memory that contained the private information of other customers, such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data.<sup>[1]</sup> As a

Defense-in-depth: per-request isolation -> even a buggy runtime cannot allow cross-user leakage



#### 5 languages ~

Tools 🗸

#### Cloudbleed :=

Talk Article

From Wikipedia, the free encyclopedia

**Cloudbleed** was a **Cloudflare buffer overflow** disclosed by Project Zero on February 17, 2017. Cloudflare's code disclosed the contents of memory that contained the private information of other customers, such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data.<sup>[1]</sup> As a

Defense-in-depth: per-request isolation -> even a buggy runtime cannot allow cross-user leakage AKA: put the Wasm sandbox boundary between requests



#### 5 languages ~

Tools 🗸

• Step 1: use someone else's runtime

- Step 1: use someone else's runtime
  - Our friends at Mozilla do lots of fuzzing, testing, security things; SpiderMonkey is battle-tested

- Step 1: use someone else's runtime
  - Our friends at Mozilla do lots of fuzzing, testing, security things; SpiderMonkey is battle-tested
- Step 2: avoid the JIT compiler of that runtime

- Step 1: use someone else's runtime
  - Our friends at Mozilla do lots of fuzzing, testing, security things; SpiderMonkey is battle-tested
- Step 2: avoid the JIT compiler of that runtime
  - Optimization logic bugs (especially type confusion) account for many CVEs

- Step 1: use someone else's runtime
  - Our friends at Mozilla do lots of fuzzing, testing, security things; SpiderMonkey is battle-tested
- Step 2: avoid the JIT compiler of that runtime
  - Optimization logic bugs (especially type confusion) account for many CVEs
  - But don't you want performance?! (yes... we'll get there)

- Step 1: use someone else's runtime
  - Our friends at Mozilla do lots of fuzzing, testing, security things; SpiderMonkey is battle-tested
- Step 2: avoid the JIT compiler of that runtime
  - Optimization logic bugs (especially type confusion) account for many CVEs
  - But don't you want performance?! (yes... we'll get there)
  - ... and it's all we can run on Wasm anyway, today

- Step 1: use someone else's runtime
  - Our friends at Mozilla do lots of fuzzing, testing, security things; SpiderMonkey is battle-tested
- Step 2: avoid the JIT compiler of that runtime
  - Optimization logic bugs (especially type confusion) account for many CVEs
  - But don't you want performance?! (yes... we'll get there)
  - ... and it's all we can run on Wasm anyway, today
- Step 3: ... fix performance

- Step 1: use someone else's runtime
  - Our friends at Mozilla do lots of fuzzing, testing, security things; SpiderMonkey is battle-tested
- Step 2: avoid the JIT compiler of that runtime
  - Optimization logic bugs (especially type confusion) account for many CVEs
  - But don't you want performance?! (yes... we'll get there)
  - ... and it's all we can run on Wasm anyway, today
- Step 3: ... fix performance (???)

implementation of a language



- implementation of a language
- Typical tiering architectures in JITs mean that interpreters are "simple" focus on correctness rather than (too much) performance



- implementation of a language
- Typical tiering architectures in JITs mean that interpreters are "simple" focus on correctness rather than (too much) performance
- An interpreter is *portable* and thus can be developed on native platforms

- implementation of a language
- Typical tiering architectures in JITs mean that interpreters are "simple" focus on correctness rather than (too much) performance
- **Overview**

#### Build and test passing

rr is a lightweight tool for recording, replaying and debugging execution of applications (trees of processes and threads). Debugging extends gdb with very efficient reverse-execution, which in combination with standard gdb/ x86 features like hardware data watchpoints, makes debugging much more fun. More information about the

#### An interpreter is the easiest — and thus most likely to be correctly written —

#### An interpreter is *portable* — and thus can be developed on native platforms

- implementation of a language
- Typical tiering architectures in JITs mean that interpreters are "simple" focus on correctness rather than (too much) performance
- An interpreter is *portable* and thus can be developed on native platforms **Overview** 
  - rr currently requires either:
    - An Intel CPU with Nehalem (2010) or later microarchitecture.
    - Certain AMD Zen or later processors (see https://github.com/rr-debugger/rr/wiki/Zen)



- implementation of a language
- Typical tiering architectures in JITs mean that interpreters are "simple" focus on correctness rather than (too much) performance
- An interpreter is *portable* and thus can be developed on native platforms Native x86\_64-linux rr pert gdb valgrind asan

- implementation of a language
- Typical tiering architectures in JITs mean that interpreters are "simple" focus on correctness rather than (too much) performance
- An interpreter is *portable* and thus can be developed on native platforms Native x86\_64-linux Wasmtime rr gdb kind of works pert gdb valgrind asan

- implementation of a language
- Typical tiering architectures in JITs mean that interpreters are "simple" focus on correctness rather than (too much) performance
- An interpreter is *portable* and thus can be developed on native platforms Native x86\_64-linux Wasmtime rr gdb kind of works perf but steps through runtime too gdb valgrind asan



- implementation of a language
- Typical tiering architectures in JITs mean that interpreters are "simple" focus on correctness rather than (too much) performance
- An interpreter is *portable* and thus can be developed on native platforms Native x86\_64-linux Wasmtime rr gdb kind of works perf but steps through runtime too gdb **DWARF** transform assertion failures valgrind asan



- implementation of a language
- Typical tiering architectures in JITs mean that interpreters are "simple" focus on correctness rather than (too much) performance
- An interpreter is *portable* and thus can be developed on native platforms Native x86\_64-linux Wasmtime rr gdb kind of works perf but steps through runtime too gdb **DWARF** transform assertion failures valgrind asan





# Single Source of Truth

- Interpreter will exist anyway (JIT tiers, fallback); is easier to get right; works fine on Wasm (naturally portable); it's just... slow
- Can we keep the interpreter as the only language implementation, and somehow derive a compiler from it?

#### **Compiler Backend?**

```
switch(*pc++) {
  case ADD:
    auto a = pop();
    auto b = pop();
    push(a + b);
    break;
  case RET:
    return pop();
```

func: ADD RET

#### **Compiler Backend?**

switch(\*pc++) { case ADD: auto a = pop();auto b = pop();push(a + b);break; case RET: return pop();

#### func: ADD RET

#### func() auto a = pop();auto b = pop(); push(a + b);return pop();

#### **Compiler Backend?**



Key insight: Wasm is a small, introspectable, well-behaved IR; *partial evaluation* should be tractable (moreso than on native code)

#### weval: Partial Evaluation of Wasm

to some constant inputs (namely, interpreted bytecode)

• Key idea: produce specializations of functions in a Wasm module with respect



### weval: Partial Evaluation of Wasm

- Key idea: produce *specializations* of functions in a Wasm module with respect to some constant inputs (namely, interpreted bytecode)
  - Very very very important guiding principle: *no magic*, only semanticspreserving transforms; specialized function behaves identically to original



### weval: Partial Evaluation of Wasm

- Key idea: produce *specializations* of functions in a Wasm module with respect to some constant inputs (namely, interpreted bytecode)
  - Very very very important guiding principle: *no magic*, only semanticspreserving transforms; specialized function behaves identically to original
  - Gives us a compiler "for free" once we have an interpreter



## **Specialization Intrinsics**

void interp(bytecode\* pc) {

```
while (true) {
   switch (*pc++) {
    case OP1:
```

```
break;
case OP2:
```

 $\bullet \bullet \bullet$ 

 $\bullet \bullet \bullet$ 

break;

#### **Specialization Intrinsics**

void interp(bytecode\* pc) {

```
while (true) {
  switch (*pc++) {
    case OP1:
```

```
break;
case OP2:
```

 $\bullet \bullet \bullet$ 

 $\bullet \bullet \bullet$ 

break;

```
void interp(bytecode* pc) {
  weval::push context(pc);
  while (true) {
    switch (*pc++) {
      case OP1:
```

weval::update context(pc); break; case OP2:

weval::update context(pc); break;

## **Specialization Intrinsics**

- 1. "No magic": only expand code where interpreter specifies via context mechanism
- 2. Partially evaluate iterations of the interpreter loop in a contextsensitive way, where the context is the bytecode PC
- 3. ... and that's it.

```
void interp(bytecode* pc) {
 weval::push context(pc);
  while (true) {
    switch (*pc++) {
      case OP1:
```

weval::update context(pc); break; case OP2:

weval::update context(pc); break;

This works; shipping in StarlingMonkey runtime; ~2-3x speedups

- This works; shipping in StarlingMonkey runtime; ~2-3x speedups
- We are deriving a JIT from first principles from an interpreter

- This works; shipping in StarlingMonkey runtime; ~2-3x speedups
- We are deriving a JIT from first principles from an interpreter
- We are avoiding doing anything special or language/JIT-engine-specific

- This works; shipping in StarlingMonkey runtime; ~2-3x speedups
- We are deriving a JIT from first principles from an interpreter
- We are avoiding doing anything special or language/JIT-engine-specific
- We think we can get more optimizations by writing semantics-preserving rules
  - E.g., profile-guided speculative inlining + box-unbox elision to get typespecialized unboxing in JS

- Correct software is a never-fully-attained goal (realistically)
- But we can carefully delineate abstraction boundaries and validate them separately



- Observation: limited formal methods can be practical in practice
  - SMT-based checking of compiler lowering rules
  - Symbolic checker of register allocator
  - Maybe? proof-carrying code for sandboxing logic



- Observation: limited formal methods can be practical in practice
  - SMT-based checking of compiler lowering rules
  - Symbolic checker of register allocator
  - Maybe? proof-carrying code for sandboxing logic
- Observation: meta-compilers (deriving compilers from simpler representations) can be practical in practice
  - weval is much smaller than the full compiler-to-Wasm would have been



- Observation: limited formal methods can be practical in practice
  - SMT-based checking of compiler lowering rules
  - Symbolic checker of register allocator
  - Maybe? proof-carrying code for sandboxing logic
- Observation: meta-compilers (deriving compilers from simpler representations) can be practical in practice
  - weval is much smaller than the full compiler-to-Wasm would have been
- Wasm has set an excellent precedent for explicit semantics, static typing, and focus on small clean abstractions



#### Thanks! Questions?