Skip to content

[MINOR][CI] Fix leaked threads hanging Java test forks#2488

Open
Baunsgaard wants to merge 5 commits into
apache:mainfrom
Baunsgaard:ci/increase-java-test-timeout
Open

[MINOR][CI] Fix leaked threads hanging Java test forks#2488
Baunsgaard wants to merge 5 commits into
apache:mainfrom
Baunsgaard:ci/increase-java-test-timeout

Conversation

@Baunsgaard

@Baunsgaard Baunsgaard commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Prevent leaked threads from hanging Java test forks in CI

Several Java test suites (most visibly **.component.c** and data.misc/lineage) intermittently ran until the GitHub Actions job timeout even though the tests themselves had completed. The cause was leaked non-daemon threads keeping the surefire fork JVM alive, so the fork never exited and the job stalled until cancelled. There were two sources: in-JVM federated workers (FederatedWorker's Netty event loops and the test-side worker wrapper threads were non-daemon), and CommonThreadPool's fallback pools — when called off the main thread, it returned Executors.newFixedThreadPool/newCachedThreadPool, which default to non-daemon threads, while only the ForkJoinPool-backed variants were already daemon.

This PR makes those threads daemon at the source: FederatedWorker now creates its Netty event-loop groups with a daemon thread factory, and CommonThreadPool routes its fixed/cached fallbacks through one too, so daemon behavior is uniform across all pool variants. On the test side, AutomatedTestBase marks spawned worker threads as daemon, TestUtils.shutdownThread bounds its join (30s, warns on stragglers, restores the interrupt flag), and the lineage tests (LineageFedReuseAlg, FedFullReuseTest, FedUDFReuseTest) now shut workers down in a finally block so failures no longer leak workers (the large line counts there are just reindentation from the try/finally wrap). The javaTests.yml job cap stays at 30 minutes, with a comment documenting why it sits above the 600s per-fork surefire timeout, which remains the backstop for genuine hangs.

The component.c** suite cannot finish within the 30-minute GitHub
Actions job cap. The cap cancels the whole job before surefire's
per-fork timeout (test-forkedProcessTimeout, 600s) can kill hung forks
and run the remaining tests in the suite.

Raise the job cap to 60 minutes so surefire's kill-and-continue has
room to complete the suite instead of GitHub Actions cancelling the job
mid-run.
@github-project-automation github-project-automation Bot moved this to In Progress in SystemDS PR Queue Jun 11, 2026
@Baunsgaard Baunsgaard changed the title Raise Java test job timeout to 60 minutes [MINOR][CI] Raise Java test job timeout to 60 minutes Jun 11, 2026
In-JVM federated workers (used by the lineage federated tests) ran on
non-daemon threads, and their Netty boss/worker event loops were also
non-daemon. When a test failed before calling shutdownThreads(...), the
leaked worker kept the surefire fork JVM alive until the CI job timeout
cancelled the whole job.

- FederatedWorker: create the Netty event-loop groups with daemon
  threads. A standalone worker process keeps the JVM alive via its main
  thread, so this only affects in-JVM (test) workers, where it lets a
  leaked worker no longer block JVM exit.
- AutomatedTestBase: mark the in-JVM worker wrapper threads as daemon.
- TestUtils.shutdownThread: bound the join() so cleanup cannot block
  indefinitely if a worker ignores the interrupt.
- LineageFedReuseAlg, FedFullReuseTest, FedUDFReuseTest: shut down
  workers in a finally block so they are reaped on the failure path.
CommonThreadPool.get(k) returns a ForkJoinPool (daemon threads) for
main/PARFOR/FedExec threads, but falls back to a plain
Executors.newFixedThreadPool(k) for any other caller thread, and
getDynamicPool() uses Executors.newCachedThreadPool(); both default to
non-daemon threads.

Under parallel test execution the test thread is not the master thread,
so component tests that compress directly hit the fallback path. A pool
left unshut (e.g. stored as a field or skipped on an exception path)
then keeps its non-daemon threads alive and blocks the surefire fork
JVM from exiting, stalling the suite until the CI job timeout.

Make both fallback pools use daemon threads, matching the ForkJoinPool
paths. Daemon-ness only affects JVM shutdown: pending work keeps the
submitting thread alive, so this cannot terminate in-flight tasks early.
@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.42%. Comparing base (17803b1) to head (ae14a16).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2488      +/-   ##
============================================
+ Coverage     71.41%   71.42%   +0.01%     
- Complexity    48817    48844      +27     
============================================
  Files          1572     1572              
  Lines        189089   189115      +26     
  Branches      37101    37106       +5     
============================================
+ Hits         135030   135077      +47     
+ Misses        43596    43572      -24     
- Partials      10463    10466       +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Restore the 30-minute job cap now that the underlying thread-leak hangs
are fixed (daemon Netty event loops, daemon CommonThreadPool fallback
pools, and bounded test worker cleanup). The 600s per-fork surefire
timeout still operates well within this cap, so a genuinely hung fork is
killed by surefire before GitHub Actions cancels the job, restoring fast
feedback on real hangs.
@Baunsgaard Baunsgaard changed the title [MINOR][CI] Raise Java test job timeout to 60 minutes [MINOR][CI] Fix leaked threads hanging Java test forks Jun 15, 2026
@Baunsgaard Baunsgaard requested a review from janniklinde June 15, 2026 14:52
@Baunsgaard

Copy link
Copy Markdown
Contributor Author

@janniklinde thanks to your points in the previous commit, I found the source of the issue for timeouts in many tests. It basically is because our parallel threads were not deamon threads.

Would you like to review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant