Skip to content

Stack-switching: Corruption of CallThreadState when host function called on continuation #13322

@alexcrichton

Description

@alexcrichton

This input:

use wasmtime::*;

fn main() -> Result<()> {
    let mut config = Config::new();
    config.wasm_stack_switching(true);
    config.wasm_function_references(true);
    config.wasm_exceptions(true);
    let engine = Engine::new(&config)?;
    let mut store = Store::new(&engine, ());

    let module = Module::new(
        &engine,
        r#"
        (module
          (type $ft (func))
          (tag $t (type $ft))
          (type $ct (cont $ft))

          (import "host" "h" (func $h_import))

          (func (export "inner"))

          (func $callee
            (call $h_import)
            (suspend $t)
          )
          (elem declare func $callee)

          (func (export "go") (result i32)
            (block $h (result (ref null $ct))
              (resume $ct (on $t $h) (cont.new $ct (ref.func $callee)))
              (return (i32.const 0))
            )
            (drop)
            (i32.const 1)
          )
        )
    "#,
    )?;

    let h = Func::wrap(&mut store, |mut caller: Caller<'_, ()>| -> Result<()> {
        let inner = caller
            .get_export("inner")
            .and_then(|e| e.into_func())
            .expect("inner export");
        let inner = inner.typed::<(), ()>(&caller)?;
        inner.call(&mut caller, ())?;
        Ok(())
    });

    let instance = Instance::new(&mut store, &module, &[h.into()])?;
    let go = instance.get_typed_func::<(), i32>(&mut store, "go")?;

    let result = go.call(&mut store, ())?;
    assert_eq!(result, 1);
    Ok(())
}

yields:

$ cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.08s
     Running `target/debug/repro`

thread 'main' (1071502) panicked at /home/alex/code/wasmtime2/crates/wasmtime/src/runtime/vm/traphandlers.rs:493:13:
assertion failed: self.unwind.replace(None).is_none()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

LLM-generated description, possibly wrong, is:

Details # `EntryStoreContext::enter_wasm` saves the *new* `stack_chain`, breaking nested embedder→wasm calls during stack switching

Scope: crates/wasmtime/src/runtime/func.rs:1587-1606
(EntryStoreContext::enter_wasm),
crates/wasmtime/src/runtime/func.rs:1614-1629
(EntryStoreContext::exit_wasm),
crates/wasmtime/src/runtime/vm/traphandlers.rs:667-696
(CallThreadState::swap).

Severity: Stack switching is currently 🚧 (work-in-progress) on
x86_64 Cranelift, so this is not a security issue per Wasmtime's tier
policy today. It is, however, a soundness bug that must be fixed
before stack switching can graduate to a stable feature: a guest that
does nothing more than call a normal host import — where the host
happens to call back into the same Store via the embedder API —
silently invalidates the running continuation's cx.stack_chain. The
same bug also affects any future stack-switching-aware fiber-suspension
path through CallThreadState::swap.

Reproducible: Yes. Test program in this folder
(enter-wasm-stack-chain-repro) exercises only public Wasmtime API
(Func::wrap, Caller::get_export, Func::typed, Func::call) plus
the gated wasm_stack_switching(true) config bit. Running with
RECURSIVE=1 produces Ok(0) instead of the expected Ok(1).

Summary

EntryStoreContext::enter_wasm is responsible for saving the
pre-entry values of various VMStoreContext fields so that
exit_wasm (run on Drop) and CallThreadState::swap (run during
async fiber suspension) can put the store back the way it was before
this wasm activation started. Five of the six fields it saves are
read without being written, so the saved value is the pre-entry
value as intended:

last_wasm_exit_pc:        *(*vm_store_context).last_wasm_exit_pc.get(),
last_wasm_exit_trampoline_fp: *(*vm_store_context).last_wasm_exit_trampoline_fp.get(),
last_wasm_entry_fp:       *(*vm_store_context).last_wasm_entry_fp.get(),
last_wasm_entry_sp:       *(*vm_store_context).last_wasm_entry_sp.get(),
last_wasm_entry_trap_handler: *(*vm_store_context).last_wasm_entry_trap_handler.get(),

stack_chain, however, is written before being read:

unsafe {
    let vm_store_context = store.0.vm_store_context();
    let new_stack_chain = VMStackChain::InitialStack(initial_stack_information);
    *vm_store_context.stack_chain.get() = new_stack_chain;            // (1) write

    Self {stack_chain: (*(*vm_store_context).stack_chain.get()).clone(), // (2) read AFTER (1)}
}

So Self.stack_chain ends up holding the new
InitialStack(initial_stack_information) rather than the value
stack_chain had on entry to enter_wasm. Both consumers of
Self.stack_chain (drop-time exit_wasm, async-fiber swap) write
that value back into the live vm_store_context, so any pre-entry
value (a Continuation(...) from being inside a resume body, or
the outer activation's InitialStack(...) for a nested host→wasm
call) is unrecoverable.

This is a regression from commit
192f2fcdadfec9d0cf6b58548a85a7307450cbf5 ("Replace setjmp/longjmp
usage in Wasmtime"). Before that commit enter_wasm did the read
first, then the write:

let stack_chain = (*store.0.vm_store_context().stack_chain.get()).clone();   // read FIRST
let new_stack_chain = VMStackChain::InitialStack(initial_stack_information);
*store.0.vm_store_context().stack_chain.get() = new_stack_chain;             // write AFTERSelf {, stack_chain,}

The setjmp/longjmp removal commit collapsed the load+store into a
single expression but moved the store before the struct literal,
silently flipping the saved value from "previous" to "new".

What the contract is supposed to be

CallThreadState::swap (vm/traphandlers.rs:656-696) is the most
explicit description:

/// Swaps the state in this `CallThreadState`'s `VMStoreContext` with
/// the state in `EntryStoreContext` that was saved when this
/// activation was created.
///
/// This method is using during suspension of a fiber to restore the
/// store back to what it originally was and prepare it to be resumed
/// later on. … That restores a store to just before this activation
/// was called but saves off the fields of this activation to get
/// restored/resumed at a later time.

"Restores a store to just before this activation was called" can only
hold if Self.stack_chain is the value vm_store_context.stack_chain
had before enter_wasm ran. The same is true of exit_wasm,
which mirror-images swap for the synchronous path.

The reproducer

src/main.rs here uses only public Wasmtime API. The wasm module
contains:

  • a tag $t, continuation type $ct,
  • an export inner (called by the host via the embedder API),
  • a continuation body $callee which calls a host import $h_import,
    then executes (suspend $t),
  • an export go which (resume $ct (on $t $h) (cont.new $ct (ref.func $callee))) and returns 1 if the suspend reaches the
    handler, 0 if the resume completes "normally".

The host import $h_import is Func::wrap-defined. When
RECURSIVE=1, it looks up the inner export through the
Caller<'_, T> and calls it via the embedder API. That recursive
embedder→wasm call is what triggers the inner
EntryStoreContext::enter_wasm/exit_wasm pair that overwrites the
running continuation's cx.stack_chain.

Compile with:

cargo build --release

Without recursion (h does nothing):

$ ./target/release/repro
[wasm trace] 10
[wasm trace] 1
[host] h called; recursive=false
[wasm trace] 2
[wasm trace] 12          ← suspend reached the handler
go returned: Ok(1)
OK: suspend correctly delivered to outer handler.

With recursion (h calls inner via embedder API):

$ RECURSIVE=1 ./target/release/repro
[wasm trace] 10
[wasm trace] 1
[host] h called; recursive=true
[wasm trace] 100         ← inner runs
[host] inner returned to host
[wasm trace] 2
[wasm trace] 11          ← resume took the "normal completion" path
go returned: Ok(0)
BUG OBSERVED: go returned Ok(0); …

Trace 12 is unreachable except via the suspend handler; trace 11
is unreachable except via "the resume completed without ever firing
$t". Adding nothing more than a recursive embedder call between the
two flips the outcome from one to the other — exactly the symptom
that follows from cx.stack_chain being clobbered to InitialStack
during inner's enter_wasm/exit_wasm pair, after which
search_handler's is_initial_stack short-circuit returns "no
match" and the (suspend $t) cannot find its outer handler.

(I verified in a temporarily-instrumented build that the inner
enter_wasm reads prev_chain = Continuation(...) and the
matching inner exit_wasm then writes
InitialStack(<inner csi pointer>) back into the live
VMStoreContext. That instrumentation is not part of the test
program shipped here.)

Suggested fix

Restore the pre-regression order and save the previous value of
stack_chain before writing the new one. The minimal patch:

unsafe {
    let vm_store_context = store.0.vm_store_context();
    let prev_stack_chain = mem::replace(
        &mut *vm_store_context.stack_chain.get(),
        VMStackChain::InitialStack(initial_stack_information),
    );

    Self {
        stack_limit,
        last_wasm_exit_pc:        *(*vm_store_context).last_wasm_exit_pc.get(),
        last_wasm_exit_trampoline_fp: *(*vm_store_context).last_wasm_exit_trampoline_fp.get(),
        last_wasm_entry_fp:       *(*vm_store_context).last_wasm_entry_fp.get(),
        last_wasm_entry_sp:       *(*vm_store_context).last_wasm_entry_sp.get(),
        last_wasm_entry_trap_handler: *(*vm_store_context).last_wasm_entry_trap_handler.get(),
        stack_chain:              prev_stack_chain,
        vm_store_context,
    }
}

This mirrors the way stack_limit is already saved earlier in the
same function via mem::replace and matches the pre-regression
semantics.

Why the symptom is "wrong return value" rather than a SIGSEGV here

The dangling-pointer payload of the corrupted InitialStack(...) is
what's theoretically dangerous (cranelift resume/switch lower
to get_common_stack_information(...).load_state(...), which would
load from the freed stack region). In this test
search_handler's discriminant check sees InitialStack first and
short-circuits to the "tag not handled" path before any
dereference, so we observe the silent control-flow bug rather than a
crash. A reproducer that performs (switch $ct $t) instead of
(suspend $t) after the recursive call would dereference the
dangling pointer (unchecked_get_continuation interprets the
payload as a *mut VMContRef), but at that point we are firmly in
"a guest can crash the host" territory which is not yet a security
issue because stack switching is gated.

Other manifestations

Even with stack switching disabled, the same enter_wasm writes
InitialStack(<this activation's csi>) to stack_chain and saves
that same value as the "previous" one. After a non-stack-switching
nested host→wasm call (outer wasm → host import → inner wasm via the
embedder API), cx.stack_chain is left pointing into the freed
inner invoke_wasm_and_catch_traps stack frame, and the outer
wasm's EntryStoreContext has lost the original pre-entry value.
That dangling pointer is currently only consumed by the backtrace
machinery (vm/traphandlers/backtrace.rs:236) and the iterator
short-circuit for InitialStack happens to avoid loading from it,
so the misbehaviour is silent today — but the saved value is wrong
relative to the documented contract regardless.

For the async path, CallThreadState::swap puts the saved (=new)
value back into cx.stack_chain on every fiber suspension. If a
host then drops the suspended fiber without resuming it, the fiber
stack is freed and cx.stack_chain is left pointing into the freed
fiber stack until the next enter_wasm overwrites it. Same
description as above: silent today, broken contract, will start
crashing as soon as anything between fiber-drop and the next
enter_wasm reads cx.stack_chain non-trivially.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions