[ExecuTorch][WebGPU] Register-tile the q4gsw quantized-linear kernel by pytorchbot · Pull Request #20500 · pytorch/executorch

pytorchbot · 2026-06-25T00:18:19Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #20456 by @JulianCloudNTH
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/51/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/51/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/51/orig

@diff-train-skip-merge

Pull Request resolved: #20456 **Register-tile the `et_vk.linear_q4gsw` GEMM — up to 3.4x faster prefill (M4 Pro, M=128).** **Problem:** `et_vk.linear_q4gsw` (4-bit weight-only, W4A16) computes `out[m,n] = bias[n] + sum_k input[m,k] * (nibble(weight,n,k)-8) * scale[k/group_size, n]` in a single dispatch over a raw `[N, K/2]` 4-bit weight (2 nibbles/byte, +8-shifted symmetric, groupwise scales). The shipped kernel was naive: one workgroup per output row `m`, threads striding `N`, a scalar K-loop per `(m,n)`. For an M-row (prefill) input it re-extracts every dequantized weight `M` times (once per row) and re-reads each input value once per output column — redundant memory traffic that dominates the prefill GEMM. **Solution:** a register-tiled GEMM where each thread owns a `TM x TN = 4x4` output tile, so both weights and inputs are loaded once per tile instead of once per element. - Before: weight `(n,k)` dequantized once per `(m,n)` (extracted `M`x for prefill); `input[m,k]` re-read once per output column `n`. - After: weight `(n,k)` dequantized ONCE and reused across the `TM` rows of the tile (weight reads drop ~`TM`x); each `input[m,k]` loaded once per `k` into a register and reused across the `TN` columns (input reads drop ~`TN`x). **Implementation:** - New loop nest in `q4gsw_linear.wgsl`: per `k`, hoist the `TM` input values into registers, then for each of the `TN` columns dequantize the weight once and accumulate into the `4x4` register tile. - Host dispatch changes from `M` workgroups to `ceil(M/TM)*ceil(N/TN)` tiles over `wg_size` threads; `wg_size` is computed before the count so the dispatch is still validated against device limits before any allocation. - Tile-edge lanes (`n0+nl >= N` or `m0+ml >= M`) clamp their weight/scale/input index to the last valid element (the never-stored overhang is harmless), since WGSL out-of-bounds reads are implementation-defined. Mirrors the Vulkan tiled GEMM `q4gsw_linear_gemm__w_4x8.glsl`'s `min(..., N-1)` clamp. - Deliberate deviations from the Vulkan kernel (recorded in DESIGN_DECISIONS): a `4x4` tile (vs Vulkan `4M x 8N`) for a conservative register budget; the RAW `[N,K/2]` layout with scalar nibble unpack and NO `W_4X8` prepack / NO wide `vec4<u32>` loads (prior on-device measurement found wide loads regress on this GPU); a 1D-flattened tile index (the backend is 1D-dispatch only). **Constraints:** bindings, `Params`, the weight layout, and the single-dispatch structure are unchanged; the dequant index math is copied verbatim from the naive kernel, so the result is a floating-point accumulation reorder equal to the naive output to fp-rounding. The `M=1` decode GEMV path and host M-based routing are a separate follow-up. Authored with assistance from Claude Code. ghstack-source-id: 396677641 @exported-using-ghexport Differential Revision: [D109250327](https://our.internmc.facebook.com/intern/diff/D109250327/)

pytorch-bot · 2026-06-25T00:18:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20500

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI runner label rename: rebase PRs using old linux.rocm.gpu.gfx950.* labels

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-25T00:19:26Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Pull Request resolved: #20457 Add optimized GEMV kernel for M==1 decode path in q4gsw quantized-linear. **Problem**: The register-tiled GEMM (from D109250327) wastes 75% of each 4×N tile when M=1, as only 1 of 4 rows is used. **Solution**: Add a cooperative GEMV kernel that routes M==1 decode to a more efficient path: - **GEMV**: 64 lanes per workgroup cooperate over K-dimension, each lane loads u32 words (8 K-values), reduces via shared memory - **GEMM**: M>1 prefill continues using the tiled GEMM **Routing Logic** (build-time selection, M is static per graph): - Use GEMV when: M==1 && K%8==0 && group_size%8==0 - Otherwise: Fall back to tiled GEMM **Constraints**: - K%8==0: Kernel loads 8 K-values per u32 word - group_size%8==0: Ensures no quantization-group boundary splits a word (validated via CPU cross-check) - Llama models (group_size=32/64) satisfy both constraints **Implementation**: - New kernel: q4gsw_linear_coop4.wgsl (fixed 64-lane workgroup) - New utility: clamp_workgroup_count() for grid-stride dispatch (vs compute_1d_workgroup_count which throws) - Shared infrastructure: Same bind layout, Params, weight format **Performance**: Keeps decode at measured bandwidth plateau, avoids M=1 tile waste. GEMV uses different reduction order (agrees to fp-rounding, not bit-exact). ghstack-source-id: 396677650 @exported-using-ghexport Differential Revision: [D109250570](https://our.internmc.facebook.com/intern/diff/D109250570/)

…dths Pull Request resolved: #20458 Add optimized vec4 kernel for bandwidth-bound rms_norm on Llama decode. **Problem**: Scalar kernel loads one element per lane per iteration — bandwidth-limited on Llama decode. **Solution**: Add vec4 kernel that loads/stores four contiguous elements as `vec4<f32>` and squares them with `dot(v, v)`, cutting loop iterations 4× and widening memory transactions. **Routing Logic**: - Use vec4 when: row_width % 4 == 0 - Otherwise: Fall back to scalar kernel **Constraints**: - row_width % 4 == 0: vec4 kernel has no partial-texel tail handling - Llama models (all hidden sizes 4-aligned) satisfy constraint **Implementation**: - New kernel: rms_norm_vec4.wgsl (same 64-lane workgroup) - Shared infrastructure: Same bind layout, Params, dispatch - Numerical: Float reassociation differs, not bit-identical to scalar **Performance**: ~33% faster on Apple M4 Pro / Metal across benchmark shapes (largest on decode, smallest on long prefill where already bandwidth-bound). This change was authored with assistance from Claude. ghstack-source-id: 396677654 @exported-using-ghexport Differential Revision: [D109333390](https://our.internmc.facebook.com/intern/diff/D109333390/)

pytorchbot had a problem deploying to cadence June 25, 2026 00:18 — with GitHub Actions Error

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 25, 2026

JulianCloudNTH added 3 commits June 24, 2026 17:25

Merge branch 'main' into gh/JulianCloudNTH/51/orig

fdff0a6

JulianCloudNTH temporarily deployed to cadence June 25, 2026 00:26 — with GitHub Actions Inactive

JulianCloudNTH self-requested a review June 25, 2026 00:29

JulianCloudNTH approved these changes Jun 25, 2026

View reviewed changes

JulianCloudNTH merged commit 80b6c34 into main Jun 25, 2026
173 of 176 checks passed

JulianCloudNTH deleted the gh/JulianCloudNTH/51/orig branch June 25, 2026 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Register-tile the q4gsw quantized-linear kernel#20500

[ExecuTorch][WebGPU] Register-tile the q4gsw quantized-linear kernel#20500
JulianCloudNTH merged 4 commits into
mainfrom
gh/JulianCloudNTH/51/orig

pytorchbot commented Jun 25, 2026

Uh oh!

pytorch-bot Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pytorchbot commented Jun 25, 2026

Uh oh!

pytorch-bot Bot commented Jun 25, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20500

❗ 1 Active SEVs

Uh oh!

github-actions Bot commented Jun 25, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

This PR needs a `release notes:` label