Disconnect non-consuming IPC subscribers to fix publisher memory leak (#68114)#69436
Disconnect non-consuming IPC subscribers to fix publisher memory leak (#68114)#69436dwoz wants to merge 3 commits into
Conversation
|
Context on prior work in this area (none of it surfaced in PR-list / timeline searches because it landed as direct-to-branch commits rather than as PRs):
This PR is a clean re-implementation of 4c0e552, without the merge-conflict damage, plus:
Default timeout bumped from 15s (original) to 30s — more headroom for legitimately slow consumers under load — but is configurable and |
IPCMessagePublisher.publish() spawned an unbounded number of stream.write() coroutines per subscriber per message. If a subscriber stopped consuming, every following publish queued another _write coroutine that awaited a stream.write() that would never complete; the pending writes (and their referenced payloads) accumulated in the publisher's ioloop and inflated the master EventPublisher's RSS without bound -- the "non consuming IPC clients" leak. Wrap _write() with the new ipc_write_timeout opt (default 30s, 0 disables) via salt.ext.tornado.gen.with_timeout. On timeout, close the stream and discard it from self.streams so the publisher reclaims memory and stops spawning writers for that client. publish() now also skips streams that still have a pending write buffer, which prevents piling up more coroutines on a slow subscriber before the timeout fires. Documented on master and minion config pages and added a changelog fragment. Fixes saltstack#68114
A stale `.. conf_minion:: ipc_write_timeout` block (Default: 15, versionadded:: 3006.11) was inherited from origin/3006.x as a carryover from the reverted 4c0e552 attempt. It collides with this PR's new `.. conf_minion:: ipc_write_timeout` entry in minion.rst, causing Sphinx -W to abort with a duplicate description warning and cascading failures across the docs build and 28+ dependent CI jobs. The correct master-side entry already exists at master.rst:1077 (`.. conf_master:: ipc_write_timeout`, Default: 30) matching salt.defaults.IPC_WRITE_TIMEOUT. Remove only the stale minion directive block; no other changes.
…any buffered byte The first cut of the saltstack#68114 leak fix skipped publish() for any stream with _write_buffer_size > 0. That predicate fires on a consumer that is only momentarily behind during an event burst, so tests/pytests/unit/utils/event/ test_event.py::test_event_many_backlog (500-event burst) lost event saltstack#279 deterministically on Photon 5 Arm64 FIPS and Rocky 8 Arm64. Replace the buffer-byte skip with a per-stream counter of in-flight _write coroutines, capped by a new ipc_publisher_pending_writes opt (default 10000). Each pending write holds a payload reference; bounding the count is what actually prevents the leak. Consumers that are momentarily behind get the full burst, while a truly non-consuming subscriber stops accumulating new writes once the cap is reached -- the oldest pending write then trips ipc_write_timeout and the subscriber is dropped (existing behavior). Add a regression test that publishes a heavy burst at a stuck subscriber and asserts the per-stream pending count never exceeds the configured cap. The existing saltstack#68114 leak-drop test still passes, confirming the memory fix's intent is preserved. Refs saltstack#68114
|
Superseded by #69227 (merged as fa7e51f). #69227 converted That structural change removes the surface this PR was patching:
Verified |
What does this PR do?
IPCMessagePublishernow disconnects a subscriber that does not consumemessages within a configurable timeout (
ipc_write_timeout, default 30s).This stops the master event publisher from accumulating pending
stream.write()coroutines (and their referenced payloads) on a stalledIPC client, which was a real source of unbounded RSS growth.
What issues does this PR fix or reference?
Fixes #68114
Previous Behavior
IPCMessagePublisher.publish()didself.io_loop.spawn_callback(self._write, stream, pack)for everysubscriber on every published message.
_writethen awaitedstream.write(pack)with no timeout. A subscriber that stopped readingcaused every following publish to queue another
_writecoroutine thatawaited a write that never completed; the pending writes and their
payloads piled up in the ioloop and the publisher's RSS grew without
bound.
New Behavior
_writewrapsstream.write(pack)insalt.ext.tornado.gen.with_timeout(...)using the newipc_write_timeoutopt (seconds; default 30,
0disables the timeout for legacy behavior).When a single write does not complete in time, the publisher logs a
warning, closes the offending stream, and removes it from
self.streams.publish()also skips streams that still have a pending write buffer sowe no longer pile up more
_writecoroutines on a slow client while theexisting one is waiting to time out.
The opt is documented on the master and minion configuration pages and a
changelog fragment was added.
Merge requirements satisfied?
test:full)tests/pytests/unit/transport/test_ipc.py::test_ipc_publisher_drops_non_consuming_client_68114)changelog/68114.fixed.md)Commits signed with GPG?
Yes