Skip to content

Wrap timed custom_kernel launches in a cudaProfiler capture range (qr_v2, eigh_py)#157

Open
robobryce wants to merge 1 commit into
gpu-mode:mainfrom
robobryce:profiler-capture-range
Open

Wrap timed custom_kernel launches in a cudaProfiler capture range (qr_v2, eigh_py)#157
robobryce wants to merge 1 commit into
gpu-mode:mainfrom
robobryce:profiler-capture-range

Conversation

@robobryce

@robobryce robobryce commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What

Wrap the timed custom_kernel launches in eval.py's benchmark loop in a torch.cuda.profiler.profile() context manager, for the two linalg problems that ship a problem-local eval.py with this timing loop: qr_v2 and eigh_py.

start_event.record()
with torch.cuda.profiler.profile():
    outputs = [custom_kernel(data) for data in data_list]
end_event.record()

Why

The benchmark timing loop interleaves, per repeat, an L2-cache flush (clear_l2_cache()) and — in recheck/leaderboard modes — a reference correctness check around the timed custom_kernel launches, with one-time input generation and warmup above the loop. Profiling eval.py directly therefore captures all of that mixed in with the submission's own kernels:

  • one-time input generation (for several cases this runs torch.linalg.qr to build the random orthogonal basis — i.e. cuSOLVER work),
  • the warmup launches + initial correctness check,
  • a per-repeat clear_l2_cache(), and
  • the per-repeat reference correctness checker (a batch of FP64 residual matmuls, run on every repeat in recheck/leaderboard mode) — the dominant chunk of extra GPU work.

That makes nsys / ncu output hard to attribute, and it is especially costly for ncu, which replays every kernel it sees with tens of instrumentation passes — so without a capture range it spends most of its time on the reference checker and the input-gen QR rather than the kernel under test.

Wrapping only the timed launches lets a profiler started with the matching capture-range flag record exactly those launches and nothing else. The context manager is a no-op when no profiler is attached, so test, benchmark, and leaderboard runs — and the timings they report — are unaffected (verified on B200: with the range in place nsys captures one timed iteration's custom_kernel launches; without a profiler attached, benchmark timing is unchanged).

How to enable it

The wrapper is inert on its own — it only takes effect when the profiler is launched with the matching capture-range flag. Run the problem's eval.py in benchmark mode over a one-line shape spec, the same way the repo already runs it (POPCORN_FD set, the spec file as argv[2]), wrapped in the profiler. The relevant flags:

Nsight Systems (timeline) — --capture-range=cudaProfilerApi restricts the trace to the wrapped launches; --capture-range-end=stop ends collection after the first timed iteration, so the report is one clean iteration instead of the whole repeat loop:

printf 'batch: 640; n: 512; cond: 2; seed: 1029\n' > shape.txt

POPCORN_FD=3 nsys profile \
    --capture-range=cudaProfilerApi --capture-range-end=stop \
    --trace=cuda,nvtx -o prof \
    python problems/linalg/eigh_py/eval.py benchmark shape.txt 3>/dev/null

nsys stats --report cuda_gpu_kern_sum prof.nsys-rep

Nsight Compute (per-kernel) — --profile-from-start off makes ncu instrument only the kernels inside the range, skipping the input generation, warmup, L2 flush, and the per-repeat reference checker instead of replaying all of them:

POPCORN_FD=3 ncu --profile-from-start off --set full -o prof \
    python problems/linalg/eigh_py/eval.py benchmark shape.txt 3>/dev/null

Without either flag the wrapper is a no-op and the run behaves exactly as before. Swap the eigh_py path/shape for qr_v2 to profile that problem.

Scope

Two files, +16/−2 — only the profiler context manager:

  • problems/linalg/qr_v2/eval.py
  • problems/linalg/eigh_py/eval.py

The benchmark timing loop in eval.py interleaves, per repeat, an L2-cache flush
and (in recheck/leaderboard modes) a cuSOLVER/cuBLAS reference correctness check
around the timed custom_kernel launches. Profiling eval.py directly therefore
captures all of that — warmup, the L2 flush, and the reference solver — not just
the submission's kernels, making nsys/ncu output hard to attribute.

Wrap only the timed `custom_kernel` launches in `torch.cuda.profiler.profile()`.
A profiler started with `nsys --capture-range=cudaProfilerApi` or
`ncu --profile-from-start off` then records exactly those launches and nothing
else. The context manager is a no-op when no profiler is attached, so test,
benchmark, and leaderboard runs and their reported timings are unchanged.

Applied to both linalg problems that ship a problem-local eval.py with this
timing loop: qr_v2 and eigh_py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants