[ExecuTorch][WebGPU] Coalesce SDPA AV V-cache reads along contiguous head-dim by JulianCloudNTH · Pull Request #20459 · pytorch/executorch

JulianCloudNTH · 2026-06-23T20:21:54Z

Stack from ghstack (oldest at bottom):

~19% faster SDPA attention-output (AV) stage — 393→317 µs on llama3 prefill (Chrome Canary / M4 Pro).

Problem: V-cache reads load 4 strided context rows × 1 head-dim lane, missing coalescing.

Solution: Flip access pattern to read 4 contiguous head-dim lanes per context row:

Before: load_v_vec4(d, kvh, c4) → 4 strided rows, dot() along D
After: load_v_d4(c, kvh, d0) → 4 contiguous D-lanes (16-byte texel), scalar broadcast

Implementation:

Reindex load_v helper to read contiguous head-dim
Replace dot(A, V) with acc += A[c] * V_vec4(d0:d0+3)
Mirrors Vulkan load_v_cache_d4 coalescing pattern

Constraints:

No KV-cache layout change (still [C, Hkv, D])
Output numerically identical (FP-reassociated, max abs diff 1.43e-6 vs torch)
@exported-using-ghexport

Differential Revision: D109339276

[ghstack-poisoned]

pytorch-bot · 2026-06-23T20:21:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20459

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 2 Unrelated Failures

As of commit 25551a5 with merge base e03f777 ():

NEW FAILURE - The following job has failed:

pull / unittest-buck / linux / linux-job (gh)
RuntimeError: Command docker exec -t a2fb19a2b9ed0f275fa0b09bb2430472bff849159b4923813cf3834265a14296 /exec failed with exit code 3

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)
pull / unittest-buck / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 3

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-23T20:22:57Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…head-dim Pull Request resolved: #20459 **~19% faster SDPA attention-output (AV) stage** — 393→317 µs on llama3 prefill (Chrome Canary / M4 Pro). **Problem**: V-cache reads load 4 strided context rows × 1 head-dim lane, missing coalescing. **Solution**: Flip access pattern to read 4 contiguous head-dim lanes per context row: - **Before**: `load_v_vec4(d, kvh, c4)` → 4 strided rows, `dot()` along D - **After**: `load_v_d4(c, kvh, d0)` → 4 contiguous D-lanes (16-byte texel), scalar broadcast **Implementation**: - Reindex `load_v` helper to read contiguous head-dim - Replace `dot(A, V)` with `acc += A[c] * V_vec4(d0:d0+3)` - Mirrors Vulkan `load_v_cache_d4` coalescing pattern **Constraints**: - No KV-cache layout change (still `[C, Hkv, D]`) - Output numerically identical (FP-reassociated, max abs diff 1.43e-6 vs torch) ghstack-source-id: 395771238 @exported-using-ghexport Differential Revision: [D109339276](https://our.internmc.facebook.com/intern/diff/D109339276/)

[ghstack-poisoned]

SS-JIA

Review automatically exported from Phabricator review in Meta.

[ghstack-poisoned]

Update

8e68294

[ghstack-poisoned]

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 23, 2026

JulianCloudNTH mentioned this pull request Jun 23, 2026

[ExecuTorch][WebGPU] Register-tile the SDPA QK/AV kernels #20405

Open

JulianCloudNTH temporarily deployed to cadence June 23, 2026 20:47 — with GitHub Actions Inactive

Update

233231a

[ghstack-poisoned]

This was referenced Jun 24, 2026

[ExecuTorch][WebGPU] SDPA: skip QK contraction for fully-masked causal tiles #20492

Open

[ExecuTorch][WebGPU] SDPA: branchless aligned/tail loads in the QK/AV kernels #20493

Open

JulianCloudNTH temporarily deployed to cadence June 24, 2026 19:54 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 24, 2026

SS-JIA approved these changes Jun 24, 2026

View reviewed changes

Update

25551a5

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 25, 2026 02:35 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Coalesce SDPA AV V-cache reads along contiguous head-dim#20459

[ExecuTorch][WebGPU] Coalesce SDPA AV V-cache reads along contiguous head-dim#20459
JulianCloudNTH wants to merge 3 commits into
gh/JulianCloudNTH/54/basefrom
gh/JulianCloudNTH/54/head

JulianCloudNTH commented Jun 23, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

SS-JIA left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

JulianCloudNTH commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20459

❗ 2 Active SEVs

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

github-actions Bot commented Jun 23, 2026

This PR needs a release notes: label

Uh oh!

SS-JIA left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JulianCloudNTH commented Jun 23, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 23, 2026 •

edited

Loading

This PR needs a `release notes:` label