cranelift: fold `ctz`/`clz` comparisons against zero into direct LSB / sign-bit tests by ggreif · Pull Request #13332 · bytecodealliance/wasmtime

ggreif · 2026-05-09T15:04:24Z

Four mid-end ISLE rules in `opts/icmp.isle` for the boolean-context cases — when `ctz`/`clz` flows into a comparison against zero (the consumer cares only "is it zero?", not the numeric value):

```
ctz(X) == 0 iff (X & 1) != 0 ; LSB of X set
ctz(X) != 0 iff (X & 1) == 0 ; LSB of X clear
clz(X) == 0 iff X <signed 0 ; MSB of X set (X is negative)
clz(X) != 0 iff X >=signed 0 ; MSB of X clear (X is non-negative)
```

The bit-counting instruction is DCE'd; backend emits a single-cycle `test reg, imm` (LSB case) or `test reg, reg; js` (sign case) instead of `TZCNT/BSF/LZCNT/BSR` + `TEST` + `JCC` — saves ~3 cycles per occurrence on Intel x86_64 (TZCNT/LZCNT are 3-cycle latency with a false GPR dep), proportionally more on JIT-less backends.

Why this matters in practice

The poster-child workload is the Motoko runtime's discriminator test on every `Nat`/`Int` operation:

Compact (scalar) integers: low bit clear — fast path is plain Wasm i32/i64 arithmetic.
Heap-allocated big integers (via libtommath): low bit set (skew tag).

Every arithmetic op begins with this LSB test. The Motoko codegen (`src/codegen/instrList.ml:97-100`) already emits the LSB-test-of-AND-1 pattern as `(ctz X)` — unconditionally, no flag gate — so every moc-compiled wasm running on wasmtime today does TZCNT + TEST + JCC on the hot path of every numeric op. The Rust RTS / GC paths that work on the same tagged pointer scheme see the same pattern.

With these rules in place, cranelift collapses the comparison back to a single `test r, 1` — restoring the original cost of the discriminator and unlocking measurable speed-ups for every Motoko canister on a wasmtime-based IC subnet (and any other wasm that produces this shape).

The clz / sign-bit half exists for the same reason on the rare paths that test sign before dispatching; structurally parallel rewrite, ships in the same patch.

The converse fold on the wasm-byte-savings side is in WebAssembly/binaryen#8562 (LSB→ctz under `-Os`); landing it there together with this in cranelift gives byte savings without cycle cost.

Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (`ctz(X) == 4` must not trigger — that's a numeric-value test on the count, a different rewrite family).

…gn-bit tests When the result of a count-trailing/leading-zeros instruction is fed into a comparison against zero (the only thing the consumer cares about is whether the count is zero, not its numeric value), rewrite to test the corresponding bit of X directly: ctz(X) == 0 iff LSB of X is set iff (X & 1) != 0 clz(X) == 0 iff MSB of X is set iff X is signed-negative The bit-counting instruction can then be DCE'd. Backend emits a single-cycle `test reg, imm` (LSB case) or `test reg, reg; js` (sign case) instead of TZCNT/BSF/LZCNT/BSR + TEST + JCC — saves ~3 cycles of latency on Intel x86_64 per occurrence and removes the false GPR dependency. JIT-less backends benefit even more: their bit-counting paths are typically loops. Motivated by the converse wasm-side peephole in WebAssembly/binaryen#8562 (LSB→ctz fold under -Os for byte savings). With these mid-end rules in place, that fold is cycle-neutral on cranelift JITs even when fed unconditionally. Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (ctz(X) == 4 must NOT trigger — that's a numeric-value test on the count, a different rewrite family).

ggreif · 2026-05-09T16:47:11Z

Sketched an extension to also catch the wasm-emitted shape brif (ireduce.i32 (ctz.i64 X)) (and clz), which is what frontends like Motoko's moc produce — wasm's if takes an i32 condition, so the i64 LSB test always flows through ireduce and brif directly, with no icmp interposed for the egraph rules in this PR to match.

Scope creep: the natural place is each backend's is_nonzero helper (x64 inst.isle:3806-3826, aarch64 inst.isle:4659-4670, plus riscv64 and s390x), where rules like

(rule (is_nonzero (ctz val @ (value_type (ty_32_or_64 ty))))
  (CondResult.CC (x64_test ty val (RegMemImm.Imm 1)) (CC.Z)))

(rule (is_nonzero (clz val @ (value_type (ty_32_or_64 ty))))
  (CondResult.CC (x64_test ty val val) (CC.NS)))

would lower brif (ctz X) to test X, 1; jz and brif (clz X) to test X, X; jns directly, plus an (ireduce _ (ctz/clz _)) variant for the wasm path.

That's 4× backend files + filetests, different reviewers per arch, and a different review audience from this egraph PR. Punting on the amendment and filing a separate follow-up instead.

ggreif · 2026-05-09T16:50:16Z

Concrete real-world workload for the clz boolean-context fold: Motoko's classical-persistence backend already emits the bare i32.clz; if shape for Int <-> 0 sign tests (e.g. Prim.abs). The compiler-side peephole (shrU 31; if -> clz; if(swap) and and 0x80000000; if -> clz; if(swap) in motoko/src/codegen/instrList.ml) generates this directly because their target is i32 wasm without a wrap.

So the JIT-side fold here is the natural meeting point for classical-Motoko output: clz directly into brif, no icmp, no ireduce. The rules in this PR don't yet catch that shape (it's brif (clz X), not icmp eq (clz X) 0), but the equivalent backend lowering (per the previous comment) would close that gap end-to-end.

ggreif requested a review from a team as a code owner May 9, 2026 15:04

ggreif requested review from cfallin and removed request for a team May 9, 2026 15:04

ggreif marked this pull request as draft May 9, 2026 15:08

ggreif force-pushed the gabor/clz-ctz-bool-fold branch from 1734a52 to 30531ed Compare May 9, 2026 15:14

ggreif changed the title ~~cranelift: fold ctz/clz comparisons against saturation values into direct LSB / null tests~~ cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests May 9, 2026

ggreif marked this pull request as ready for review May 9, 2026 16:48

ggreif changed the title ~~cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests~~ cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cranelift: fold `ctz`/`clz` comparisons against zero into direct LSB / sign-bit tests#13332

cranelift: fold `ctz`/`clz` comparisons against zero into direct LSB / sign-bit tests#13332
ggreif wants to merge 1 commit intobytecodealliance:mainfrom
ggreif:gabor/clz-ctz-bool-fold

ggreif commented May 9, 2026 •

edited

Loading

Uh oh!

ggreif commented May 9, 2026

Uh oh!

ggreif commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ggreif commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this matters in practice

Uh oh!

ggreif commented May 9, 2026

Uh oh!

ggreif commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ggreif commented May 9, 2026 •

edited

Loading