Wrap timed custom_kernel launches in a cudaProfiler capture range (qr_v2, eigh_py)#157
Open
robobryce wants to merge 1 commit into
Open
Wrap timed custom_kernel launches in a cudaProfiler capture range (qr_v2, eigh_py)#157robobryce wants to merge 1 commit into
robobryce wants to merge 1 commit into
Conversation
The benchmark timing loop in eval.py interleaves, per repeat, an L2-cache flush and (in recheck/leaderboard modes) a cuSOLVER/cuBLAS reference correctness check around the timed custom_kernel launches. Profiling eval.py directly therefore captures all of that — warmup, the L2 flush, and the reference solver — not just the submission's kernels, making nsys/ncu output hard to attribute. Wrap only the timed `custom_kernel` launches in `torch.cuda.profiler.profile()`. A profiler started with `nsys --capture-range=cudaProfilerApi` or `ncu --profile-from-start off` then records exactly those launches and nothing else. The context manager is a no-op when no profiler is attached, so test, benchmark, and leaderboard runs and their reported timings are unchanged. Applied to both linalg problems that ship a problem-local eval.py with this timing loop: qr_v2 and eigh_py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Wrap the timed
custom_kernellaunches ineval.py's benchmark loop in atorch.cuda.profiler.profile()context manager, for the twolinalgproblems that ship a problem-localeval.pywith this timing loop:qr_v2andeigh_py.Why
The benchmark timing loop interleaves, per repeat, an L2-cache flush (
clear_l2_cache()) and — in recheck/leaderboard modes — a reference correctness check around the timedcustom_kernellaunches, with one-time input generation and warmup above the loop. Profilingeval.pydirectly therefore captures all of that mixed in with the submission's own kernels:torch.linalg.qrto build the random orthogonal basis — i.e. cuSOLVER work),clear_l2_cache(), andThat makes
nsys/ncuoutput hard to attribute, and it is especially costly forncu, which replays every kernel it sees with tens of instrumentation passes — so without a capture range it spends most of its time on the reference checker and the input-gen QR rather than the kernel under test.Wrapping only the timed launches lets a profiler started with the matching capture-range flag record exactly those launches and nothing else. The context manager is a no-op when no profiler is attached, so
test,benchmark, andleaderboardruns — and the timings they report — are unaffected (verified on B200: with the range in placensyscaptures one timed iteration'scustom_kernellaunches; without a profiler attached, benchmark timing is unchanged).How to enable it
The wrapper is inert on its own — it only takes effect when the profiler is launched with the matching capture-range flag. Run the problem's
eval.pyinbenchmarkmode over a one-line shape spec, the same way the repo already runs it (POPCORN_FDset, the spec file as argv[2]), wrapped in the profiler. The relevant flags:Nsight Systems (timeline) —
--capture-range=cudaProfilerApirestricts the trace to the wrapped launches;--capture-range-end=stopends collection after the first timed iteration, so the report is one clean iteration instead of the whole repeat loop:Nsight Compute (per-kernel) —
--profile-from-start offmakesncuinstrument only the kernels inside the range, skipping the input generation, warmup, L2 flush, and the per-repeat reference checker instead of replaying all of them:POPCORN_FD=3 ncu --profile-from-start off --set full -o prof \ python problems/linalg/eigh_py/eval.py benchmark shape.txt 3>/dev/nullWithout either flag the wrapper is a no-op and the run behaves exactly as before. Swap the
eigh_pypath/shape forqr_v2to profile that problem.Scope
Two files, +16/−2 — only the profiler context manager:
problems/linalg/qr_v2/eval.pyproblems/linalg/eigh_py/eval.py