Skip to content

cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests#13332

Open
ggreif wants to merge 1 commit intobytecodealliance:mainfrom
ggreif:gabor/clz-ctz-bool-fold
Open

cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests#13332
ggreif wants to merge 1 commit intobytecodealliance:mainfrom
ggreif:gabor/clz-ctz-bool-fold

Conversation

@ggreif
Copy link
Copy Markdown
Contributor

@ggreif ggreif commented May 9, 2026

Four mid-end ISLE rules in `opts/icmp.isle` for the boolean-context cases — when `ctz`/`clz` flows into a comparison against zero (the consumer cares only "is it zero?", not the numeric value):

```
ctz(X) == 0 iff (X & 1) != 0 ; LSB of X set
ctz(X) != 0 iff (X & 1) == 0 ; LSB of X clear
clz(X) == 0 iff X <signed 0 ; MSB of X set (X is negative)
clz(X) != 0 iff X >=signed 0 ; MSB of X clear (X is non-negative)
```

The bit-counting instruction is DCE'd; backend emits a single-cycle `test reg, imm` (LSB case) or `test reg, reg; js` (sign case) instead of `TZCNT/BSF/LZCNT/BSR` + `TEST` + `JCC` — saves ~3 cycles per occurrence on Intel x86_64 (TZCNT/LZCNT are 3-cycle latency with a false GPR dep), proportionally more on JIT-less backends.

Why this matters in practice

The poster-child workload is the Motoko runtime's discriminator test on every `Nat`/`Int` operation:

  • Compact (scalar) integers: low bit clear — fast path is plain Wasm i32/i64 arithmetic.
  • Heap-allocated big integers (via libtommath): low bit set (skew tag).

Every arithmetic op begins with this LSB test. The Motoko codegen (`src/codegen/instrList.ml:97-100`) already emits the LSB-test-of-AND-1 pattern as `(ctz X)` — unconditionally, no flag gate — so every moc-compiled wasm running on wasmtime today does TZCNT + TEST + JCC on the hot path of every numeric op. The Rust RTS / GC paths that work on the same tagged pointer scheme see the same pattern.

With these rules in place, cranelift collapses the comparison back to a single `test r, 1` — restoring the original cost of the discriminator and unlocking measurable speed-ups for every Motoko canister on a wasmtime-based IC subnet (and any other wasm that produces this shape).

The clz / sign-bit half exists for the same reason on the rare paths that test sign before dispatching; structurally parallel rewrite, ships in the same patch.

The converse fold on the wasm-byte-savings side is in WebAssembly/binaryen#8562 (LSB→ctz under `-Os`); landing it there together with this in cranelift gives byte savings without cycle cost.

Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (`ctz(X) == 4` must not trigger — that's a numeric-value test on the count, a different rewrite family).

@ggreif ggreif requested a review from a team as a code owner May 9, 2026 15:04
@ggreif ggreif requested review from cfallin and removed request for a team May 9, 2026 15:04
@ggreif ggreif marked this pull request as draft May 9, 2026 15:08
…gn-bit tests

When the result of a count-trailing/leading-zeros instruction is fed
into a comparison against zero (the only thing the consumer cares
about is whether the count is zero, not its numeric value), rewrite
to test the corresponding bit of X directly:

  ctz(X) == 0   iff  LSB of X is set     iff  (X & 1) != 0
  clz(X) == 0   iff  MSB of X is set     iff  X is signed-negative

The bit-counting instruction can then be DCE'd. Backend emits a
single-cycle `test reg, imm` (LSB case) or `test reg, reg; js`
(sign case) instead of TZCNT/BSF/LZCNT/BSR + TEST + JCC — saves
~3 cycles of latency on Intel x86_64 per occurrence and removes
the false GPR dependency. JIT-less backends benefit even more:
their bit-counting paths are typically loops.

Motivated by the converse wasm-side peephole in
WebAssembly/binaryen#8562 (LSB→ctz fold under -Os for byte savings).
With these mid-end rules in place, that fold is cycle-neutral on
cranelift JITs even when fed unconditionally.

Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a
negative case (ctz(X) == 4 must NOT trigger — that's a numeric-value
test on the count, a different rewrite family).
@ggreif ggreif force-pushed the gabor/clz-ctz-bool-fold branch from 1734a52 to 30531ed Compare May 9, 2026 15:14
@ggreif ggreif changed the title cranelift: fold ctz/clz comparisons against saturation values into direct LSB / null tests cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests May 9, 2026
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented May 9, 2026

Sketched an extension to also catch the wasm-emitted shape brif (ireduce.i32 (ctz.i64 X)) (and clz), which is what frontends like Motoko's moc produce — wasm's if takes an i32 condition, so the i64 LSB test always flows through ireduce and brif directly, with no icmp interposed for the egraph rules in this PR to match.

Scope creep: the natural place is each backend's is_nonzero helper (x64 inst.isle:3806-3826, aarch64 inst.isle:4659-4670, plus riscv64 and s390x), where rules like

(rule (is_nonzero (ctz val @ (value_type (ty_32_or_64 ty))))
  (CondResult.CC (x64_test ty val (RegMemImm.Imm 1)) (CC.Z)))

(rule (is_nonzero (clz val @ (value_type (ty_32_or_64 ty))))
  (CondResult.CC (x64_test ty val val) (CC.NS)))

would lower brif (ctz X) to test X, 1; jz and brif (clz X) to test X, X; jns directly, plus an (ireduce _ (ctz/clz _)) variant for the wasm path.

That's 4× backend files + filetests, different reviewers per arch, and a different review audience from this egraph PR. Punting on the amendment and filing a separate follow-up instead.

@ggreif ggreif marked this pull request as ready for review May 9, 2026 16:48
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented May 9, 2026

Concrete real-world workload for the clz boolean-context fold: Motoko's classical-persistence backend already emits the bare i32.clz; if shape for Int <-> 0 sign tests (e.g. Prim.abs). The compiler-side peephole (shrU 31; if -> clz; if(swap) and and 0x80000000; if -> clz; if(swap) in motoko/src/codegen/instrList.ml) generates this directly because their target is i32 wasm without a wrap.

So the JIT-side fold here is the natural meeting point for classical-Motoko output: clz directly into brif, no icmp, no ireduce. The rules in this PR don't yet catch that shape (it's brif (clz X), not icmp eq (clz X) 0), but the equivalent backend lowering (per the previous comment) would close that gap end-to-end.

@ggreif ggreif changed the title cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant