[ExecuTorch][WebGPU] Coalesce SDPA AV V-cache reads along contiguous head-dim#20459
[ExecuTorch][WebGPU] Coalesce SDPA AV V-cache reads along contiguous head-dim#20459JulianCloudNTH wants to merge 3 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20459
Note: Links to docs will display an error until the docs builds have been completed. ❗ 2 Active SEVsThere are 2 currently active SEVs. If your PR is affected, please view them below:
❌ 1 New Failure, 2 Unrelated FailuresAs of commit 25551a5 with merge base e03f777 ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
…head-dim Pull Request resolved: #20459 **~19% faster SDPA attention-output (AV) stage** — 393→317 µs on llama3 prefill (Chrome Canary / M4 Pro). **Problem**: V-cache reads load 4 strided context rows × 1 head-dim lane, missing coalescing. **Solution**: Flip access pattern to read 4 contiguous head-dim lanes per context row: - **Before**: `load_v_vec4(d, kvh, c4)` → 4 strided rows, `dot()` along D - **After**: `load_v_d4(c, kvh, d0)` → 4 contiguous D-lanes (16-byte texel), scalar broadcast **Implementation**: - Reindex `load_v` helper to read contiguous head-dim - Replace `dot(A, V)` with `acc += A[c] * V_vec4(d0:d0+3)` - Mirrors Vulkan `load_v_cache_d4` coalescing pattern **Constraints**: - No KV-cache layout change (still `[C, Hkv, D]`) - Output numerically identical (FP-reassociated, max abs diff 1.43e-6 vs torch) ghstack-source-id: 395771238 @exported-using-ghexport Differential Revision: [D109339276](https://our.internmc.facebook.com/intern/diff/D109339276/)
SS-JIA
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
Stack from ghstack (oldest at bottom):
~19% faster SDPA attention-output (AV) stage — 393→317 µs on llama3 prefill (Chrome Canary / M4 Pro).
Problem: V-cache reads load 4 strided context rows × 1 head-dim lane, missing coalescing.
Solution: Flip access pattern to read 4 contiguous head-dim lanes per context row:
load_v_vec4(d, kvh, c4)→ 4 strided rows,dot()along Dload_v_d4(c, kvh, d0)→ 4 contiguous D-lanes (16-byte texel), scalar broadcastImplementation:
load_vhelper to read contiguous head-dimdot(A, V)withacc += A[c] * V_vec4(d0:d0+3)load_v_cache_d4coalescing patternConstraints:
[C, Hkv, D])@exported-using-ghexport
Differential Revision: D109339276
Differential Revision: D109339276