Stack-switching: Corruption of CallThreadState when host function called on continuation

This input:

```rust
use wasmtime::*;

fn main() -> Result<()> {
    let mut config = Config::new();
    config.wasm_stack_switching(true);
    config.wasm_function_references(true);
    config.wasm_exceptions(true);
    let engine = Engine::new(&config)?;
    let mut store = Store::new(&engine, ());

    let module = Module::new(
        &engine,
        r#"
        (module
          (type $ft (func))
          (tag $t (type $ft))
          (type $ct (cont $ft))

          (import "host" "h" (func $h_import))

          (func (export "inner"))

          (func $callee
            (call $h_import)
            (suspend $t)
          )
          (elem declare func $callee)

          (func (export "go") (result i32)
            (block $h (result (ref null $ct))
              (resume $ct (on $t $h) (cont.new $ct (ref.func $callee)))
              (return (i32.const 0))
            )
            (drop)
            (i32.const 1)
          )
        )
    "#,
    )?;

    let h = Func::wrap(&mut store, |mut caller: Caller<'_, ()>| -> Result<()> {
        let inner = caller
            .get_export("inner")
            .and_then(|e| e.into_func())
            .expect("inner export");
        let inner = inner.typed::<(), ()>(&caller)?;
        inner.call(&mut caller, ())?;
        Ok(())
    });

    let instance = Instance::new(&mut store, &module, &[h.into()])?;
    let go = instance.get_typed_func::<(), i32>(&mut store, "go")?;

    let result = go.call(&mut store, ())?;
    assert_eq!(result, 1);
    Ok(())
}
```

yields:

```
$ cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.08s
     Running `target/debug/repro`

thread 'main' (1071502) panicked at /home/alex/code/wasmtime2/crates/wasmtime/src/runtime/vm/traphandlers.rs:493:13:
assertion failed: self.unwind.replace(None).is_none()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```

LLM-generated description, possibly wrong, is:

<details>
# `EntryStoreContext::enter_wasm` saves the *new* `stack_chain`, breaking nested embedder→wasm calls during stack switching

**Scope:** `crates/wasmtime/src/runtime/func.rs:1587-1606`
(`EntryStoreContext::enter_wasm`),
`crates/wasmtime/src/runtime/func.rs:1614-1629`
(`EntryStoreContext::exit_wasm`),
`crates/wasmtime/src/runtime/vm/traphandlers.rs:667-696`
(`CallThreadState::swap`).

**Severity:** Stack switching is currently 🚧 (work-in-progress) on
x86_64 Cranelift, so this is not a security issue per Wasmtime's tier
policy *today*. It is, however, a soundness bug that must be fixed
before stack switching can graduate to a stable feature: a guest that
does nothing more than call a normal host import — where the host
happens to call back into the same `Store` via the embedder API —
silently invalidates the running continuation's `cx.stack_chain`. The
same bug also affects any future stack-switching-aware fiber-suspension
path through `CallThreadState::swap`.

**Reproducible:** Yes. Test program in this folder
(`enter-wasm-stack-chain-repro`) exercises only public Wasmtime API
(`Func::wrap`, `Caller::get_export`, `Func::typed`, `Func::call`) plus
the gated `wasm_stack_switching(true)` config bit. Running with
`RECURSIVE=1` produces `Ok(0)` instead of the expected `Ok(1)`.

## Summary

`EntryStoreContext::enter_wasm` is responsible for saving the
pre-entry values of various `VMStoreContext` fields so that
`exit_wasm` (run on Drop) and `CallThreadState::swap` (run during
async fiber suspension) can put the store back the way it was before
this wasm activation started. Five of the six fields it saves are
read **without** being written, so the saved value is the pre-entry
value as intended:

```rust
last_wasm_exit_pc:        *(*vm_store_context).last_wasm_exit_pc.get(),
last_wasm_exit_trampoline_fp: *(*vm_store_context).last_wasm_exit_trampoline_fp.get(),
last_wasm_entry_fp:       *(*vm_store_context).last_wasm_entry_fp.get(),
last_wasm_entry_sp:       *(*vm_store_context).last_wasm_entry_sp.get(),
last_wasm_entry_trap_handler: *(*vm_store_context).last_wasm_entry_trap_handler.get(),
```

`stack_chain`, however, is **written before being read**:

```rust
unsafe {
    let vm_store_context = store.0.vm_store_context();
    let new_stack_chain = VMStackChain::InitialStack(initial_stack_information);
    *vm_store_context.stack_chain.get() = new_stack_chain;            // (1) write

    Self {
        …
        stack_chain: (*(*vm_store_context).stack_chain.get()).clone(), // (2) read AFTER (1)
        …
    }
}
```

So `Self.stack_chain` ends up holding the *new*
`InitialStack(initial_stack_information)` rather than the value
`stack_chain` had on entry to `enter_wasm`. Both consumers of
`Self.stack_chain` (drop-time `exit_wasm`, async-fiber `swap`) write
that value back into the live `vm_store_context`, so any pre-entry
value (a `Continuation(...)` from being inside a `resume` body, or
the outer activation's `InitialStack(...)` for a nested host→wasm
call) is unrecoverable.

This is a regression from commit
`192f2fcdadfec9d0cf6b58548a85a7307450cbf5` ("Replace setjmp/longjmp
usage in Wasmtime"). Before that commit `enter_wasm` did the read
first, then the write:

```rust
let stack_chain = (*store.0.vm_store_context().stack_chain.get()).clone();   // read FIRST
let new_stack_chain = VMStackChain::InitialStack(initial_stack_information);
*store.0.vm_store_context().stack_chain.get() = new_stack_chain;             // write AFTER
…
Self { …, stack_chain, … }
```

The setjmp/longjmp removal commit collapsed the load+store into a
single expression but moved the store *before* the struct literal,
silently flipping the saved value from "previous" to "new".

## What the contract is supposed to be

`CallThreadState::swap` (`vm/traphandlers.rs:656-696`) is the most
explicit description:

```text
/// Swaps the state in this `CallThreadState`'s `VMStoreContext` with
/// the state in `EntryStoreContext` that was saved when this
/// activation was created.
///
/// This method is using during suspension of a fiber to restore the
/// store back to what it originally was and prepare it to be resumed
/// later on. … That restores a store to just before this activation
/// was called but saves off the fields of this activation to get
/// restored/resumed at a later time.
```

"Restores a store to just before this activation was called" can only
hold if `Self.stack_chain` is the value `vm_store_context.stack_chain`
had **before** `enter_wasm` ran. The same is true of `exit_wasm`,
which mirror-images `swap` for the synchronous path.

## The reproducer

`src/main.rs` here uses only public Wasmtime API. The wasm module
contains:

* a tag `$t`, continuation type `$ct`,
* an export `inner` (called by the host via the embedder API),
* a continuation body `$callee` which calls a host import `$h_import`,
  then executes `(suspend $t)`,
* an export `go` which `(resume $ct (on $t $h) (cont.new $ct
  (ref.func $callee)))` and returns `1` if the suspend reaches the
  handler, `0` if the resume completes "normally".

The host import `$h_import` is `Func::wrap`-defined. When
`RECURSIVE=1`, it looks up the `inner` export through the
`Caller<'_, T>` and calls it via the embedder API. That recursive
embedder→wasm call is what triggers the inner
`EntryStoreContext::enter_wasm`/`exit_wasm` pair that overwrites the
running continuation's `cx.stack_chain`.

Compile with:

```sh
cargo build --release
```

Without recursion (`h` does nothing):

```text
$ ./target/release/repro
[wasm trace] 10
[wasm trace] 1
[host] h called; recursive=false
[wasm trace] 2
[wasm trace] 12          ← suspend reached the handler
go returned: Ok(1)
OK: suspend correctly delivered to outer handler.
```

With recursion (`h` calls `inner` via embedder API):

```text
$ RECURSIVE=1 ./target/release/repro
[wasm trace] 10
[wasm trace] 1
[host] h called; recursive=true
[wasm trace] 100         ← inner runs
[host] inner returned to host
[wasm trace] 2
[wasm trace] 11          ← resume took the "normal completion" path
go returned: Ok(0)
BUG OBSERVED: go returned Ok(0); …
```

Trace `12` is unreachable except via the suspend handler; trace `11`
is unreachable except via "the resume completed without ever firing
$t". Adding nothing more than a recursive embedder call between the
two flips the outcome from one to the other — exactly the symptom
that follows from `cx.stack_chain` being clobbered to `InitialStack`
during `inner`'s `enter_wasm`/`exit_wasm` pair, after which
`search_handler`'s `is_initial_stack` short-circuit returns "no
match" and the `(suspend $t)` cannot find its outer handler.

(I verified in a temporarily-instrumented build that the inner
`enter_wasm` reads `prev_chain = Continuation(...)` and the
matching inner `exit_wasm` then writes
`InitialStack(<inner csi pointer>)` back into the live
`VMStoreContext`. That instrumentation is not part of the test
program shipped here.)

## Suggested fix

Restore the pre-regression order and save the previous value of
`stack_chain` *before* writing the new one. The minimal patch:

```rust
unsafe {
    let vm_store_context = store.0.vm_store_context();
    let prev_stack_chain = mem::replace(
        &mut *vm_store_context.stack_chain.get(),
        VMStackChain::InitialStack(initial_stack_information),
    );

    Self {
        stack_limit,
        last_wasm_exit_pc:        *(*vm_store_context).last_wasm_exit_pc.get(),
        last_wasm_exit_trampoline_fp: *(*vm_store_context).last_wasm_exit_trampoline_fp.get(),
        last_wasm_entry_fp:       *(*vm_store_context).last_wasm_entry_fp.get(),
        last_wasm_entry_sp:       *(*vm_store_context).last_wasm_entry_sp.get(),
        last_wasm_entry_trap_handler: *(*vm_store_context).last_wasm_entry_trap_handler.get(),
        stack_chain:              prev_stack_chain,
        vm_store_context,
    }
}
```

This mirrors the way `stack_limit` is already saved earlier in the
same function via `mem::replace` and matches the pre-regression
semantics.

## Why the symptom is "wrong return value" rather than a SIGSEGV here

The dangling-pointer payload of the corrupted `InitialStack(...)` is
what's *theoretically* dangerous (cranelift `resume`/`switch` lower
to `get_common_stack_information(...).load_state(...)`, which would
load from the freed stack region). In *this* test
`search_handler`'s discriminant check sees `InitialStack` first and
short-circuits to the "tag not handled" path before any
dereference, so we observe the silent control-flow bug rather than a
crash. A reproducer that performs `(switch $ct $t)` instead of
`(suspend $t)` after the recursive call would dereference the
dangling pointer (`unchecked_get_continuation` interprets the
payload as a `*mut VMContRef`), but at that point we are firmly in
"a guest can crash the host" territory which is not yet a security
issue because stack switching is gated.

## Other manifestations

Even with stack switching disabled, the same `enter_wasm` writes
`InitialStack(<this activation's csi>)` to `stack_chain` and saves
that same value as the "previous" one. After a non-stack-switching
nested host→wasm call (outer wasm → host import → inner wasm via the
embedder API), `cx.stack_chain` is left pointing into the freed
inner `invoke_wasm_and_catch_traps` stack frame, and the outer
wasm's `EntryStoreContext` has lost the original pre-entry value.
That dangling pointer is currently only consumed by the backtrace
machinery (`vm/traphandlers/backtrace.rs:236`) and the iterator
short-circuit for `InitialStack` happens to avoid loading from it,
so the misbehaviour is silent today — but the saved value is wrong
relative to the documented contract regardless.

For the async path, `CallThreadState::swap` puts the saved (=new)
value back into `cx.stack_chain` on every fiber suspension. If a
host then drops the suspended fiber without resuming it, the fiber
stack is freed and `cx.stack_chain` is left pointing into the freed
fiber stack until the next `enter_wasm` overwrites it. Same
description as above: silent today, broken contract, will start
crashing as soon as anything between fiber-drop and the next
`enter_wasm` reads `cx.stack_chain` non-trivially.


</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack-switching: Corruption of CallThreadState when host function called on continuation #13322

Summary

What the contract is supposed to be

The reproducer

Suggested fix

Why the symptom is "wrong return value" rather than a SIGSEGV here

Other manifestations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stack-switching: Corruption of CallThreadState when host function called on continuation #13322

Description

Summary

What the contract is supposed to be

The reproducer

Suggested fix

Why the symptom is "wrong return value" rather than a SIGSEGV here

Other manifestations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions