cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests#13332
cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests#13332ggreif wants to merge 1 commit intobytecodealliance:mainfrom
ctz/clz comparisons against zero into direct LSB / sign-bit tests#13332Conversation
…gn-bit tests When the result of a count-trailing/leading-zeros instruction is fed into a comparison against zero (the only thing the consumer cares about is whether the count is zero, not its numeric value), rewrite to test the corresponding bit of X directly: ctz(X) == 0 iff LSB of X is set iff (X & 1) != 0 clz(X) == 0 iff MSB of X is set iff X is signed-negative The bit-counting instruction can then be DCE'd. Backend emits a single-cycle `test reg, imm` (LSB case) or `test reg, reg; js` (sign case) instead of TZCNT/BSF/LZCNT/BSR + TEST + JCC — saves ~3 cycles of latency on Intel x86_64 per occurrence and removes the false GPR dependency. JIT-less backends benefit even more: their bit-counting paths are typically loops. Motivated by the converse wasm-side peephole in WebAssembly/binaryen#8562 (LSB→ctz fold under -Os for byte savings). With these mid-end rules in place, that fold is cycle-neutral on cranelift JITs even when fed unconditionally. Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (ctz(X) == 4 must NOT trigger — that's a numeric-value test on the count, a different rewrite family).
1734a52 to
30531ed
Compare
|
Sketched an extension to also catch the wasm-emitted shape Scope creep: the natural place is each backend's would lower That's 4× backend files + filetests, different reviewers per arch, and a different review audience from this egraph PR. Punting on the amendment and filing a separate follow-up instead. |
|
Concrete real-world workload for the So the JIT-side fold here is the natural meeting point for classical-Motoko output: clz directly into brif, no icmp, no ireduce. The rules in this PR don't yet catch that shape (it's |
ctz/clz comparisons against zero into direct LSB / sign-bit tests
Four mid-end ISLE rules in `opts/icmp.isle` for the boolean-context cases — when `ctz`/`clz` flows into a comparison against zero (the consumer cares only "is it zero?", not the numeric value):
```
ctz(X) == 0 iff (X & 1) != 0 ; LSB of X set
ctz(X) != 0 iff (X & 1) == 0 ; LSB of X clear
clz(X) == 0 iff X <signed 0 ; MSB of X set (X is negative)
clz(X) != 0 iff X >=signed 0 ; MSB of X clear (X is non-negative)
```
The bit-counting instruction is DCE'd; backend emits a single-cycle `test reg, imm` (LSB case) or `test reg, reg; js` (sign case) instead of `TZCNT/BSF/LZCNT/BSR` + `TEST` + `JCC` — saves ~3 cycles per occurrence on Intel x86_64 (TZCNT/LZCNT are 3-cycle latency with a false GPR dep), proportionally more on JIT-less backends.
Why this matters in practice
The poster-child workload is the Motoko runtime's discriminator test on every `Nat`/`Int` operation:
Every arithmetic op begins with this LSB test. The Motoko codegen (`src/codegen/instrList.ml:97-100`) already emits the LSB-test-of-AND-1 pattern as `(ctz X)` — unconditionally, no flag gate — so every moc-compiled wasm running on wasmtime today does TZCNT + TEST + JCC on the hot path of every numeric op. The Rust RTS / GC paths that work on the same tagged pointer scheme see the same pattern.
With these rules in place, cranelift collapses the comparison back to a single `test r, 1` — restoring the original cost of the discriminator and unlocking measurable speed-ups for every Motoko canister on a wasmtime-based IC subnet (and any other wasm that produces this shape).
The clz / sign-bit half exists for the same reason on the rare paths that test sign before dispatching; structurally parallel rewrite, ships in the same patch.
The converse fold on the wasm-byte-savings side is in WebAssembly/binaryen#8562 (LSB→ctz under `-Os`); landing it there together with this in cranelift gives byte savings without cycle cost.
Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (`ctz(X) == 4` must not trigger — that's a numeric-value test on the count, a different rewrite family).