step==1 contiguous fast path in portable compute_slice by pssrawat · Pull Request #19606 · pytorch/executorch

pssrawat · 2026-05-14T22:31:15Z

Summary:
When step == 1 (the common case: tensor.narrow, x[a:b], KV cache reads, etc.),
the per-row slice is a single contiguous block of length*length_per_step bytes.
Replace the inner loop of length separate memcpy(length_per_step) calls with a
single bulk memcpy.

For length=1 slices: equivalent (1 memcpy either way).
For length>1: ~2-10x speedup of the slice itself (fewer function calls, better
cache prefetch, SIMD-friendly bulk copy).

Llama4 speech encoder hot-path: mask.narrow (12x/chunk), freqs_cos/sin.narrow
(2x/chunk), KV reads (~5x/layer/chunk). 62 slice_copy/chunk * 72 chunks =
~4500 slices per audio prefill.

Differential Revision: D105241644

Summary: When step == 1 (the common case: tensor.narrow, x[a:b], KV cache reads, etc.), the per-row slice is a single contiguous block of length*length_per_step bytes. Replace the inner loop of length separate memcpy(length_per_step) calls with a single bulk memcpy. For length=1 slices: equivalent (1 memcpy either way). For length>1: ~2-10x speedup of the slice itself (fewer function calls, better cache prefetch, SIMD-friendly bulk copy). Llama4 speech encoder hot-path: mask.narrow (12x/chunk), freqs_cos/sin.narrow (2x/chunk), KV reads (~5x/layer/chunk). 62 slice_copy/chunk * 72 chunks = ~4500 slices per audio prefill. Differential Revision: D105241644

pytorch-bot · 2026-05-14T22:31:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19606

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

✅ No Failures

As of commit 81e9bb9 with merge base 4c474af ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-05-14T22:31:23Z

@pssrawat has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105241644.

pssrawat requested a review from manuelcandales as a code owner May 14, 2026 22:31

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 14, 2026

meta-codesync Bot added fb-exported meta-exported labels May 14, 2026

pssrawat added the release notes: none Do not include this in the release notes label May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

step==1 contiguous fast path in portable compute_slice#19606

step==1 contiguous fast path in portable compute_slice#19606
pssrawat wants to merge 1 commit into
pytorch:mainfrom
pssrawat:export-D105241644

pssrawat commented May 14, 2026

Uh oh!

pytorch-bot Bot commented May 14, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pssrawat commented May 14, 2026

Uh oh!

pytorch-bot Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19606

❗ 1 Active SEVs

✅ No Failures

Uh oh!

meta-codesync Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented May 14, 2026 •

edited

Loading