utils: Make RedisSemaphore.acquire atomic#2419
Open
ideaship wants to merge 1 commit into
Open
Conversation
RedisSemaphore.acquire() checked capacity and took a slot in two
separate Redis round-trips:
if self.redis.zcard(self.key) < self.maxsize: # CHECK
self.redis.zadd(self.key, {identifier: now}) # USE
This is a time-of-check-to-time-of-use race. Two clients can both run
ZCARD, both observe a free slot, and both proceed to ZADD, leaving the
sorted set with more than maxsize members. The semaphore that bounds
concurrent NetBox connections (create_netbox_semaphore, maxsize =
NETBOX_MAX_CONNECTIONS) therefore over-admits under contention -- a
local 50-client harness peaked at 22 holders against maxsize 5.
Replace the check-and-acquire with a single server-side Lua EVAL that
performs the expired-holder cleanup, the capacity check, and the slot
reservation in one atomic step, so no other client can interleave
between the check and the reservation.
The same change fixes a second, opposite defect in the method: the
cleanup boundary (now - 60) was computed once before the retry loop, so
it never advanced while a caller waited and holders that expired during
the wait were never reclaimed, causing spurious timeouts. now is now
recomputed on every iteration and passed to the script, so the reclaim
boundary advances with wall-clock time. The 60-second window is named
HOLDER_EXPIRY.
The race itself requires genuinely concurrent clients against a real
Redis and is not reproducible with a mock; it was validated locally
with a threaded real-Redis harness. The committed regression tests run
the production Lua script against in-memory fakeredis (added as a dev
dependency, with lupa for Lua support) and pin the two invariants the
fix guarantees: the holder set never exceeds maxsize, and the reclaim
boundary advances across retries so a slot freed mid-wait is granted.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Roger Luethi <luethi@osism.tech>
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- Consider using a monotonic clock (e.g.
time.monotonic()) for the retry loop andnowpassed into the Lua script to avoid issues if the system clock is adjusted during acquisition retries. - If this Lua script will be invoked frequently, you may want to load it and use
EVALSHAinstead ofevalto reduce script parsing overhead on Redis and make it easier to reuse acrossRedisSemaphoreinstances.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider using a monotonic clock (e.g. `time.monotonic()`) for the retry loop and `now` passed into the Lua script to avoid issues if the system clock is adjusted during acquisition retries.
- If this Lua script will be invoked frequently, you may want to load it and use `EVALSHA` instead of `eval` to reduce script parsing overhead on Redis and make it easier to reuse across `RedisSemaphore` instances.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
RedisSemaphore.acquire()checked capacity and took a slot in twoseparate Redis round-trips:
This is a time-of-check-to-time-of-use race: two clients can both run
ZCARD, both observe a free slot, and bothZADD, leaving the sorted setwith more than
maxsizemembers. The semaphore that bounds concurrentNetBox connections (
create_netbox_semaphore,maxsize = NETBOX_MAX_CONNECTIONS, default 5) therefore over-admits under contention— exactly when the cap matters most. A local 50-client harness peaked at
22 holders against maxsize 5.
A second, opposite defect lived in the same method: the cleanup boundary
now - 60was computed once before the retry loop, so it never advancedwhile a caller waited. Holders that expired during the wait were never
reclaimed, causing spurious timeouts when capacity had actually freed.
Fix
Replace the check-and-acquire with a single server-side Lua
EVALthatperforms expired-holder cleanup, the capacity check, and the slot
reservation in one atomic step, so no other client can interleave between
the check and the reservation. Because
nowis recomputed each iterationand passed into the script, the reclaim boundary advances with wall-clock
time — fixing the second defect in the same change. The 60-second window
is named
HOLDER_EXPIRY.Testing
The race needs genuinely concurrent clients against a real Redis and is
not reproducible with a mock. It was validated locally with a threaded
real-Redis harness (over-admit on the old code, never on the fixed code).
The committed regression tests run the production Lua script against
in-memory
fakeredis(added as a dev dependency, withlupafor Luasupport) and pin the two invariants: the holder set never exceeds
maxsize, and the reclaim boundary advances across retries so a slotfreed mid-wait is granted.
Notes
production defect in code that PR does not touch, so it is fixed
separately:
cleanup boundary); whichever lands second will need those assertions
reconciled against this change.
🤖 Generated with Claude Code