[MINOR][CI] Fix leaked threads hanging Java test forks#2488
Open
Baunsgaard wants to merge 5 commits into
Open
Conversation
The component.c** suite cannot finish within the 30-minute GitHub Actions job cap. The cap cancels the whole job before surefire's per-fork timeout (test-forkedProcessTimeout, 600s) can kill hung forks and run the remaining tests in the suite. Raise the job cap to 60 minutes so surefire's kill-and-continue has room to complete the suite instead of GitHub Actions cancelling the job mid-run.
In-JVM federated workers (used by the lineage federated tests) ran on non-daemon threads, and their Netty boss/worker event loops were also non-daemon. When a test failed before calling shutdownThreads(...), the leaked worker kept the surefire fork JVM alive until the CI job timeout cancelled the whole job. - FederatedWorker: create the Netty event-loop groups with daemon threads. A standalone worker process keeps the JVM alive via its main thread, so this only affects in-JVM (test) workers, where it lets a leaked worker no longer block JVM exit. - AutomatedTestBase: mark the in-JVM worker wrapper threads as daemon. - TestUtils.shutdownThread: bound the join() so cleanup cannot block indefinitely if a worker ignores the interrupt. - LineageFedReuseAlg, FedFullReuseTest, FedUDFReuseTest: shut down workers in a finally block so they are reaped on the failure path.
CommonThreadPool.get(k) returns a ForkJoinPool (daemon threads) for main/PARFOR/FedExec threads, but falls back to a plain Executors.newFixedThreadPool(k) for any other caller thread, and getDynamicPool() uses Executors.newCachedThreadPool(); both default to non-daemon threads. Under parallel test execution the test thread is not the master thread, so component tests that compress directly hit the fallback path. A pool left unshut (e.g. stored as a field or skipped on an exception path) then keeps its non-daemon threads alive and blocks the surefire fork JVM from exiting, stalling the suite until the CI job timeout. Make both fallback pools use daemon threads, matching the ForkJoinPool paths. Daemon-ness only affects JVM shutdown: pending work keeps the submitting thread alive, so this cannot terminate in-flight tasks early.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2488 +/- ##
============================================
+ Coverage 71.41% 71.42% +0.01%
- Complexity 48817 48844 +27
============================================
Files 1572 1572
Lines 189089 189115 +26
Branches 37101 37106 +5
============================================
+ Hits 135030 135077 +47
+ Misses 43596 43572 -24
- Partials 10463 10466 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Restore the 30-minute job cap now that the underlying thread-leak hangs are fixed (daemon Netty event loops, daemon CommonThreadPool fallback pools, and bounded test worker cleanup). The 600s per-fork surefire timeout still operates well within this cap, so a genuinely hung fork is killed by surefire before GitHub Actions cancels the job, restoring fast feedback on real hangs.
Contributor
Author
|
@janniklinde thanks to your points in the previous commit, I found the source of the issue for timeouts in many tests. It basically is because our parallel threads were not deamon threads. Would you like to review? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Prevent leaked threads from hanging Java test forks in CI
Several Java test suites (most visibly
**.component.c**anddata.misc/lineage) intermittently ran until the GitHub Actions job timeout even though the tests themselves had completed. The cause was leaked non-daemon threads keeping the surefire fork JVM alive, so the fork never exited and the job stalled until cancelled. There were two sources: in-JVM federated workers (FederatedWorker's Netty event loops and the test-side worker wrapper threads were non-daemon), andCommonThreadPool's fallback pools — when called off the main thread, it returnedExecutors.newFixedThreadPool/newCachedThreadPool, which default to non-daemon threads, while only the ForkJoinPool-backed variants were already daemon.This PR makes those threads daemon at the source:
FederatedWorkernow creates its Netty event-loop groups with a daemon thread factory, andCommonThreadPoolroutes its fixed/cached fallbacks through one too, so daemon behavior is uniform across all pool variants. On the test side,AutomatedTestBasemarks spawned worker threads as daemon,TestUtils.shutdownThreadbounds itsjoin(30s, warns on stragglers, restores the interrupt flag), and the lineage tests (LineageFedReuseAlg,FedFullReuseTest,FedUDFReuseTest) now shut workers down in afinallyblock so failures no longer leak workers (the large line counts there are just reindentation from thetry/finallywrap). ThejavaTests.ymljob cap stays at 30 minutes, with a comment documenting why it sits above the 600s per-fork surefire timeout, which remains the backstop for genuine hangs.