Skip to content

cranelift(x64): lower bare ctz/clz boolean tests via test+CC#13334

Closed
ggreif wants to merge 1 commit into
bytecodealliance:mainfrom
ggreif:gabor/ctz-clz-brif-lowering
Closed

cranelift(x64): lower bare ctz/clz boolean tests via test+CC#13334
ggreif wants to merge 1 commit into
bytecodealliance:mainfrom
ggreif:gabor/ctz-clz-brif-lowering

Conversation

@ggreif
Copy link
Copy Markdown
Contributor

@ggreif ggreif commented May 11, 2026

Summary

Follow-up to #13332. That PR added egraph rules collapsing (eq (ctz X) 0) / (ne (ctz X) 0) / (eq (clz X) 0) / (ne (clz X) 0) to direct LSB / sign-bit tests — but only when the comparison is mediated by an explicit icmp. The wasm front-end translates wasm if (ctz X) to brif (ireduce.i32 (ctz.i64 X)) directly (no icmp), so the egraph rules don't fire on the wasm-natural shape.

This PR closes the gap by specialising is_nonzero in the x64 backend — the helper that all brif/select/trapif lowerings funnel through.

Rules

In cranelift/codegen/src/isa/x64/inst.isle:

(rule 3 (is_nonzero (ctz (ty_32_or_64 ty) val))
      (CondResult.CC (x64_test ty val (RegMemImm.Imm 1)) (CC.Z)))
(rule 3 (is_nonzero (ireduce _ (ctz (ty_32_or_64 ty) val)))
      (CondResult.CC (x64_test ty val (RegMemImm.Imm 1)) (CC.Z)))
(rule 3 (is_nonzero (clz (ty_32_or_64 ty) val))
      (let ((gpr Gpr val)) (CondResult.CC (x64_test ty gpr gpr) (CC.NS))))
(rule 3 (is_nonzero (ireduce _ (clz (ty_32_or_64 ty) val)))
      (let ((gpr Gpr val)) (CondResult.CC (x64_test ty gpr gpr) (CC.NS))))

The ireduce variant catches the wasm front-end's i32.wrap_i64 over a 64-bit ctz/clz — a no-op on values in [0, bitwidth].

Test deltas (tests/disas/ctz-clz-bool-condition.wat)

consumer before after
if_ctz_bare_i32 5 insns (bsfl + cmovel + test + jne) 2 (testl $1, %edx; je)
if_ctz_bare_i64 5 insns (bsfq + cmovq + test + jne) 2 (testq $1, %rdx; je)
if_clz_bare_i32 7 insns (bsr + cmov + sub + test + jne) 2 (testl + jns)

The icmp-mediated cases (collapsed by #13332's egraph rules) are unchanged. The numeric-comparison negative test ((ctz X) == 4) stays untouched.

Motivation

Motoko's moc codegen emits i64.ctz X; i32.wrap_i64; if for compactness/sign tests in the EOP backend (see caffeinelabs/motoko#6103). Before this PR, that lowers to 5 native instructions per dispatch; after, 2.

A concrete idiomatic example: in Motoko, the let-else pattern over Result

let #ok payload = queryProp(...) else return defaultValue;

desugars to a 2-arm refutable variant match (#ok vs #err). The variant-tag hashes are hash("ok") = 0x611C (LSB 0) and hash("err") = 0x4D0765 (LSB 1) — they differ exactly at the LSB. The planned variant-switch BitTest dispatch (caffeinelabs/motoko's gabor/variant-switch) recognizes this and emits a single LSB-test for the dispatch; combined with this PR, the entire let-else lowers to load hash; testq $1, ...; jcc on x64 — three instructions for a pattern match. Every Result-returning API + every let-else-style early return collapses to this shape.

Aggregated across hot paths (variant-switch dispatch, GC compact/heap discriminator, sign tests, …) this is meaningful.

Follow-ups (not in this PR)

  • aarch64, riscv64, s390x analogues — separate PRs once x64 reviewer feedback lands.
  • select-consumer variant — select already routes through is_nonzero_cmpis_nonzero, so this PR's rules cover it too without extra work.

Follow-up to bytecodealliance#13332. That PR added egraph rules collapsing
`(eq (ctz X) 0)` / `(ne (ctz X) 0)` / clz analogues to direct
LSB / sign-bit tests — but only when the comparison is mediated by an
explicit `icmp`. The wasm front-end translates `wasm if (ctz X)` to
`brif (ireduce.i32 (ctz.i64 X))` directly (no `icmp`), so the egraph
rules don't fire on the wasm-natural shape.

This commit closes the gap by specialising `is_nonzero` in the x64
backend — the helper that all `brif`/`select`/`trapif` lowerings
funnel through. Four rules: `ctz`/`clz` × bare/`ireduce`-wrapped.

The `ireduce` variant catches the wasm front-end's `i32.wrap_i64`
over a 64-bit `ctz`/`clz` — a no-op on values in [0, bitwidth].

Test deltas (tests/disas/ctz-clz-bool-condition.wat):

  if_ctz_bare_i32:   5 insns -> 2 (testl $1, %edx; je)
  if_ctz_bare_i64:   5 insns -> 2 (testq $1, %rdx; je)
  if_clz_bare_i32:   7 insns -> 2 (testl %edx, %edx; jns)

The icmp-mediated cases (collapsed by bytecodealliance#13332's egraph rules) are
unchanged. The numeric-comparison negative test stays untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggreif ggreif changed the title cranelift(x64): lower bare ctz/clz boolean tests via test+CC cranelift(x64): lower bare ctz/clz boolean tests via test+CC May 11, 2026
@ggreif ggreif marked this pull request as ready for review May 11, 2026 16:17
@ggreif ggreif requested review from a team as code owners May 11, 2026 16:17
@ggreif ggreif requested review from pchickey and uweigand and removed request for a team May 11, 2026 16:17
@ggreif ggreif changed the title cranelift(x64): lower bare ctz/clz boolean tests via test+CC cranelift(x64): lower bare ctz/clz boolean tests via test+CC May 11, 2026
@github-actions github-actions Bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels May 11, 2026
@alexcrichton alexcrichton removed the request for review from uweigand May 12, 2026 01:49
@alexcrichton
Copy link
Copy Markdown
Member

@cfallin or @fitzgen do y'all have any ideas about how to sort of deduplicate this with the optimization rules landed in #13332? It feels a bit unfortunate that we need basically the same rules twice, once for general expressions and once because cond != 0 is implicit in all conditional-y locations (brif, select, trapnz, etc). Without redefining how those instructions work I'm not sure how to avoid the duplication myself, but figured I'd ask if y'all had ideas. Another option might be to only have these rules in backends as opposed to also the mid-end, but that also doesn't feel great, the duplication probably isn't so bad.

@cfallin
Copy link
Copy Markdown
Member

cfallin commented May 12, 2026

Left a comment over here; putting these rules in the backend is definitely the wrong place IMHO, as aside from the software-engineering aspects (repetition) we want these simplifications to compose with other mid-end opts when possible.

I agree it'd be great to find a way to factor out simplifications for all "non-zero test" cases in the mid-end. I suppose we could define a helper simplify_nonzero and then invoke that with toplevel simplify_skeleton rules on brif, trapnz, etc...

@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented May 12, 2026

Following on from @cfallin's (brif (clz x))(brif (band x (iconst 1))) sketch in #13336: I think the LSB extract pairs with ctz, not clz. There are actually four cases — two per producer — and the bit-extract direction is determined by the producer:

pattern tests mid-end simplification
(brif (eqz (ctz x))) LSB set (brif (band x 1)) or (brif (shl x N-1))
(brif (ctz x)) LSB clear inverse of either
(brif (eqz (clz x))) HSB set (brif (ushr x N-1))
(brif (clz x)) HSB clear (brif (eqz (ushr x N-1)))

For the LSB case band x 1 and shl x (N-1) have the same wasm byte cost (small-const + op). For the HSB case ushr x (N-1) is strictly cheaper than the equivalent band with 1 << (N-1): small consts like 31/63 are 1 byte LEB128, but 0x80000000 is 5 bytes — so prefer the shift form for HSB.

I'll try the simplify_skeleton-on-brif route locally with these rules and report back. If it lets me drop the backend duplication in this PR (and #13336), I'll push; otherwise I'll close and rethink.

@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented May 12, 2026

Update — the simplify_skeleton extension does work, and it lets us retire both this PR and #13336.

Approach (~50 lines total):

  1. New SkeletonInstSimplification::ReplaceBranchCond(Value) variant — narrow rewrite that swaps argument 0 of an existing brif without touching its opcode or successors. CFG-safe by construction.
  2. Driver patch in cranelift/codegen/src/egraph/mod.rs: allow Opcode::Brif through the previously-blanket is_branch() skip; apply ReplaceBranchCond by writing through inst_args_mut.
  3. prelude_opt.isle adds a replace_branch_cond constructor.
  4. Two new rules in opts/icmp.isle:
    (rule (simplify_skeleton (brif (ctz x_ty X) _ _))
          (replace_branch_cond
            (eq $I8 (band x_ty X (iconst_u x_ty 1)) (iconst_u x_ty 0))))
    (rule (simplify_skeleton (brif (clz x_ty X) _ _))
          (replace_branch_cond (sge $I8 X (iconst_u x_ty 0))))
    

Results on the 2-op brif (ctz x) / brif (clz x) patterns this PR targets:

platform producer mid-end alone (no backend rules)
x86_64 brif (ctz v0) testl $1, %edi; je ✓ (matches what this PR's x64 rules produce)
x86_64 brif (clz v0) testl %edi, %edi; jge ✓ (matches #13336's intent)
aarch64 brif (ctz v0) tbz w0, #0 — single-instruction test-and-branch, tighter than #13336's tst+cmp+b.cc

All 70 cranelift egraph filetests pass. The replace_branch_cond mechanism composes with future similar brif-cond simplifications (any value-level rewrite whose result is "0/non-0-equivalent to the original" qualifies).

I'll push the mid-end branch as a new PR, with revert commits for this PR and #13336. Closing once that lands cleanly.

@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented May 12, 2026

Closing in favor of #13343 — the mid-end simplify_skeleton-on-brif extension subsumes this PR's x64 backend rules, the aarch64 rules in #13336, and covers riscv64 and s390x at the same time without any per-backend rule. Cross-backend comparison table is in the PR comment.

@ggreif ggreif closed this May 12, 2026
ggreif added a commit to ggreif/wasmtime that referenced this pull request May 12, 2026
…keleton`

The mid-end rules added in bytecodealliance#13332 hinge on an `icmp eq/ne (ctz/clz X) 0`
shape — i.e. the wasm 3-op pattern `i32.ctz; i32.eqz; br_if`. Frontends
that emit the 2-op form `i32.ctz; br_if` (e.g. Motoko's `moc` after its
`and 1; eqz; br_if` → `ctz; br_if` byte-size peephole) feed `(brif (ctz X))`
into cranelift with no `icmp` for the existing rules to match.

This commit extends `simplify_skeleton` to rewrite the *condition operand*
of an existing `brif` in place, without touching its opcode or successor
blocks (CFG-preserving by construction). A new `SkeletonInstSimplification`
variant `ReplaceBranchCond(Value)` carries the new condition; the egraph
driver applies it by writing through `inst_args_mut`. Two ISLE rules in
`opts/icmp.isle` rewrite `(brif (ctz X) bt be)` and `(brif (clz X) bt be)`
to brifs over the equivalent bit-extract form:

  brif (ctz X) bt be   →   brif (eq (band X 1) 0) bt be
  brif (clz X) bt be   →   brif (sge X 0)         bt be

End-to-end lowering on the resulting brif then composes with existing
backend `icmp+brif` fusion to produce:

  x86_64  brif (ctz X):   `testl $1, %edi; je`
  x86_64  brif (clz X):   `testl %edi, %edi; jge`
  aarch64 brif (ctz X):   `tbz w0, #0` — single-instruction test-and-branch

This subsumes the backend-side x64 rules added in bytecodealliance#13334 and the aarch64
rules in bytecodealliance#13336 (and yields tighter aarch64 code than bytecodealliance#13336 did).

The driver still rejects non-`brif` branches and rejects non-`ReplaceBranchCond`
simplification variants on `brif` (a `Replace inst` of a brif would risk
changing successor block IDs and is left to a future, broader extension).

Filetest `egraph/brif-cnt-cond.clif` covers ctz/clz over i32/i64 in the
2-op form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pull Bot pushed a commit to eduardomourar/wasmtime that referenced this pull request May 13, 2026
…keleton` (bytecodealliance#13343)

* cranelift: fold `ctz`/`clz` directly into `brif` cond via `simplify_skeleton`

The mid-end rules added in bytecodealliance#13332 hinge on an `icmp eq/ne (ctz/clz X) 0`
shape — i.e. the wasm 3-op pattern `i32.ctz; i32.eqz; br_if`. Frontends
that emit the 2-op form `i32.ctz; br_if` (e.g. Motoko's `moc` after its
`and 1; eqz; br_if` → `ctz; br_if` byte-size peephole) feed `(brif (ctz X))`
into cranelift with no `icmp` for the existing rules to match.

This commit extends `simplify_skeleton` to rewrite the *condition operand*
of an existing `brif` in place, without touching its opcode or successor
blocks (CFG-preserving by construction). A new `SkeletonInstSimplification`
variant `ReplaceBranchCond(Value)` carries the new condition; the egraph
driver applies it by writing through `inst_args_mut`. Two ISLE rules in
`opts/icmp.isle` rewrite `(brif (ctz X) bt be)` and `(brif (clz X) bt be)`
to brifs over the equivalent bit-extract form:

  brif (ctz X) bt be   →   brif (eq (band X 1) 0) bt be
  brif (clz X) bt be   →   brif (sge X 0)         bt be

End-to-end lowering on the resulting brif then composes with existing
backend `icmp+brif` fusion to produce:

  x86_64  brif (ctz X):   `testl $1, %edi; je`
  x86_64  brif (clz X):   `testl %edi, %edi; jge`
  aarch64 brif (ctz X):   `tbz w0, #0` — single-instruction test-and-branch

This subsumes the backend-side x64 rules added in bytecodealliance#13334 and the aarch64
rules in bytecodealliance#13336 (and yields tighter aarch64 code than bytecodealliance#13336 did).

The driver still rejects non-`brif` branches and rejects non-`ReplaceBranchCond`
simplification variants on `brif` (a `Replace inst` of a brif would risk
changing successor block IDs and is left to a future, broader extension).

Filetest `egraph/brif-cnt-cond.clif` covers ctz/clz over i32/i64 in the
2-op form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* rustfmt: collapse `is_branch` && opcode-guard onto one line

* tests/disas: re-bless ctz/clz-bool-condition for new mid-end fold

The new `simplify_skeleton`-on-`brif` rule rewrites the 2-op
`if (ctz/clz x)` cases that bytecodealliance#13332's commentary noted were the
non-icmp-mediated holdouts. Bare-form lowering shrinks from
~9 instructions (bsf/bsr + cmov + test + jne + …) to
`testl $1, %edx; je` (ctz) and `testl %edx, %edx; jge` (clz).

Offsets on the subsequent non-bare functions shift down to match.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants