Skip to content

Test using pytest-run-parallel and related fixups in the tests#2194

Draft
seberg wants to merge 17 commits into
NVIDIA:mainfrom
seberg:ft-testing
Draft

Test using pytest-run-parallel and related fixups in the tests#2194
seberg wants to merge 17 commits into
NVIDIA:mainfrom
seberg:ft-testing

Conversation

@seberg

@seberg seberg commented Jun 10, 2026

Copy link
Copy Markdown

Description

This is the full follow up to gh-2162 for a full picture. I tried to split commits up roughly and could split them into individual PRs as well.

I am planning to have another look through myself once (see if I can think of a nicer pattern than the current mini plugins).

The buffer closing/sync is an upstream issue I think that I have opened a bug for.

And another reason to split things up and get started: rebased and of course there are new issues :).

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module cuda.pathfinder Everything related to the cuda.pathfinder module labels Jun 10, 2026
@seberg seberg changed the title Ft testing Test using pytest-run-parallel and related fixups in the tests Jun 10, 2026
@leofang leofang self-requested a review June 11, 2026 02:55
@seberg

seberg commented Jun 11, 2026

Copy link
Copy Markdown
Author

/ok to test ed40f60

@github-actions

Copy link
Copy Markdown

seberg added 16 commits June 11, 2026 16:05
This fixes a few threading issues, but we may want to discuss some
details still.
* The GraphNode cleanup order is an important fix. Another thread may
  end up with the same pointer (but new object) as soon as we clean it
  up.  So we have to remove it from the cache before cleaning it up.
* Use of atomics: I think this is needed, but for this one place
  an atomic seemed more reasonable.  (However, hard to test and if
  it can fail IIUC only on ARM.)
* The critical sections should be pretty safe.  I am not sure they
  will all ensure that the object is always the _identity_ but I am
  pretty sure it protects from worse races.
  (Testing did find this for MemPool.attributes, not others yet.
  Testing with thread-sanitizer might flush out some...)
* The split mutex: This is thread-unsafe.  But I am honestly not
  sure if that isn't just expected, or whether the mutex is good
  but it should also be safe from within CUDA.
* Use of `setdefault` cached pattern is largely just normalizing.  Without
  the `return dict.setdefault` a different instance may be returned on
  different threads (or a cache entry replaced).
  For the `cyGraphMemoryResource` that triggered a test with pytest-run-parallel
  although that doesn't mean it is problematic as such.
  `cuda-pathfinder` uses functools.cache, but usually for strings;
  the one we may want to look at is `load_nvidia_dynamic_lib`.

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
… 3.15t

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
E.g. cuda needs to be initialized for each thread, but fixtures
run before pytest-run-parallel launches the threads.
So we create a mini-plugin to deal with this.  We could also solve
this with decorators in many cases, but that would require adding
a lot of decorators...

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
- thread_unsafe: nvml init ref-count, graphMem attr, mock-based tests,
  OpenGL, peer-access pool state, multiprocessing warning, program-cache
  race reproduction, and functools.cache mutation tests
- parallel_threads_limit: IPC / worker-pool tests that spawn subprocesses
  or open file descriptors (limit 4), example tests (limit 8), and the
  event-registration test whose timeouts are slow

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
…unsafe always

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
…empool

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
After my first AI try was a crazy mess, the second run actually found
a neat solution...
These objects can be created in the main thread, but we can't create
them on the fly in many threads as it was...

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
For some reason the latch kernel helper test started failing now
(it did not before my update from CUDA 13.2 to 13.3?).

The reason isn't that it is not thread-safe, but that something
(presumably module loading/unloading) causes synchronizations which
in turn cause threads having to wait on their LatchKernel to finish.

And of course the test itself really needs that not to happen.
Making sure there is only one LatchKernel compiled and loaded exactly
once seems to avoid this problem.

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
…cal_section

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
@github-actions github-actions Bot added the CI/CD CI/CD infrastructure label Jun 11, 2026
@seberg

seberg commented Jun 11, 2026

Copy link
Copy Markdown
Author

Fun, the refactor made the cufile xfail-strict tests pass on CI, but I didn't set up the parallel run correctly... one more try:

/ok to test eb6a2ff

@seberg

seberg commented Jun 11, 2026

Copy link
Copy Markdown
Author

/ok to test eb6a2ff

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
@seberg

seberg commented Jun 11, 2026

Copy link
Copy Markdown
Author

/ok to test 7b59bff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD CI/CD infrastructure cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module cuda.pathfinder Everything related to the cuda.pathfinder module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant