Skip to content

SOLR-18284: Load-average and memory circuit breakers no longer stampede or trip on transient pre-GC heap peaks#4514

Open
markrmiller wants to merge 1 commit into
apache:mainfrom
markrmiller:cb-load-heap-improvements
Open

SOLR-18284: Load-average and memory circuit breakers no longer stampede or trip on transient pre-GC heap peaks#4514
markrmiller wants to merge 1 commit into
apache:mainfrom
markrmiller:cb-load-heap-improvements

Conversation

@markrmiller

Copy link
Copy Markdown
Member

Load-average and memory circuit breakers were doing more harm than good under high concurrency.

LoadAverageCircuitBreaker called OperatingSystemMXBean.getSystemLoadAverage() on every request. The OS load average is a one-minute moving average, so re-polling it per-request is wasted work - and when many requests arrived concurrently, the syscall stampede hammered the CPU, which is the condition this breaker exists to prevent rather than cause.

MemoryCircuitBreaker sampled MemoryMXBean.getHeapMemoryUsage().getUsed() on a 30-second moving average. With a generational collector that signal climbs toward max between collections during normal operation; the breaker would trip on transient pre-GC peaks that GC was about to reclaim.

https://issues.apache.org/jira/browse/SOLR-18284

…de or trip on transient pre-GC heap peaks

Load-average and memory circuit breakers were doing more harm than good under high concurrency.

LoadAverageCircuitBreaker called OperatingSystemMXBean.getSystemLoadAverage() on every request. The OS load average is a one-minute moving average, so re-polling it per-request is wasted work - and when many requests arrived concurrently, the syscall stampede hammered the CPU, which is the condition this breaker exists to prevent rather than cause.

MemoryCircuitBreaker sampled MemoryMXBean.getHeapMemoryUsage().getUsed() on a 30-second moving average. With a generational collector that signal climbs toward max between collections during normal operation; the breaker would trip on transient pre-GC peaks that GC was about to reclaim.
@github-actions github-actions Bot added documentation Improvements or additions to documentation tests labels Jun 9, 2026

@gerlowskija gerlowskija left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks pretty close IMO. I have two remaining big questions:

  1. Do we really need to write TtlSampledMetric ourselves, or is there something off the shelf (e.g. Caffeine?) that we could use instead?
  2. Why the MemoryCB change to start considering various memory "pools"? If this is a bug, it'd be great to that acknowledged and a sentence or two on it somewhere for posterity to understand this part of the change down the road...

Note that those are "just" questions though; they may not even require changes so much as clarification 🤷

Looks great Mark!

@@ -0,0 +1,8 @@
# See https://github.com/apache/solr/blob/main/dev-docs/changelog.adoc
title: Load-average and memory circuit breakers no longer stampede or trip on transient pre-GC heap peaks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] "stampede"? "trip"? I understand what this is pointing to, but IMO there's not quite enough information here for a user reading our changelog to judge whether they're impacted or whether they need to read more.

IMO we'd be better off breaking this into two changelog entries (i.e. totally separate files) that give a tad more detail on the negative behavior users will no longer see.

e.g.

LoadAverageCircuitBreaker now uses cached metric values to avoid consuming unnecessary CPU at high RPS.

and...

MemoryCircuitBreaker now looks at post-GC heap utilization to avoid false-positives when the heap is filled with collectible garbage.

/**
* Single-flight, time-bounded cache around an expensive metric sample.
*
* <ul>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] What is this list representing and does it belong in Javadocs? AFAICT this class doesn't expose the fresh/stale/cold distinction in any way, so it's largely an implementation detail?

* the supplier. The {@code refreshing} flag is always released, so a thrown sampler does not wedge
* the single-flight latch.
*/
final class TtlSampledMetric<T> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] The implementation here is great and I applaud your concurrency code. But I'm a little surprised that we'd have to implement this functionality ourselves and that there's nothing "off the shelf" that we could use instead. Surely we're not the first folks out there to calculate primitives with a TTL.

Did you consider (and discard) the idea of using Caffeine or something similar here instead? Or might that be worth considering?

* Tracks the current JVM heap usage and triggers if a moving heap usage average over 30 seconds
* exceeds the defined percentage of the maximum heap size allocated to the JVM. Once the average
* memory usage goes below the threshold, it will start allowing queries again.
* Trips when post-collection live data in the JVM heap exceeds a configured percentage of the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] "collection" is a pretty overloaded term in Solr; consider rephrasing to indicate garbage-collection more explicitly

* pool. This is the only memory reading that distinguishes "live data" from "garbage waiting to be
* collected."
*
* <p>Earlier versions of this breaker sampled {@link MemoryMXBean#getHeapMemoryUsage()} on a

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Is there a reason that a javadocs reader would care what the behavior used to be? This feels like LLM output that could be probably be trimmed back

* pool; in that case the breaker reports {@code 0} live bytes and will not trip until the JVM has
* performed at least one collection on a heap pool.
*
* <p>The threshold semantics are unchanged: configure a percentage of the maximum heap size, and

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] "semantics are unchanged" - Similar to my comment on L44 above, references to how things used to behave or past history are comprehensible in this PR but won't make much sense six months from now. Change-history is best left out of Javadocs IMO

* <p>Pool selection by collector:
*
* <ul>
* <li><b>G1 / Parallel / Serial / generational ZGC:</b> uses the pool whose name matches the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Can you provide a bit more context on this and the stuff at L75-L112 below? Specifically: what value does this differentiation serve? MemoryCircuitBreaker was agnostic of pool and GC-algorithm prior to this PR and there's nothing in SOLR-18284 that points to that being an issue.

Is this a fix for an additional bug that you found while working on this PR? If so, is that additional bug documented somewhere?

* synthetic value typically override this method directly (which bypasses the cache); tests that
* want to feed a synthetic sample <em>through</em> the cache should override {@link
* #samplePostGcLiveBytes()} instead. The implementation no longer averages anything.
*/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] Can you expand a bit more on the rationale for retaining the now-incorrect name? I understand the bit about having tests that override this method, but seemingly those tests could be updated if we decided to change the name of this method?

assertEquals(
"samplePostGcLiveBytes() must be invoked at most once per TTL window", 1, calls.get());
} finally {
System.setProperty(CircuitBreakerRegistry.SYSPROP_SAMPLE_TTL_MS, "0");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] This can probably be dropped given the @After method above that clears this between test methods (and the test-rule in SolrTestCase that clears it between test-classes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants