Skip to content

utils: Make RedisSemaphore.acquire atomic#2419

Open
ideaship wants to merge 1 commit into
mainfrom
fix/redis-semaphore-acquire-atomic
Open

utils: Make RedisSemaphore.acquire atomic#2419
ideaship wants to merge 1 commit into
mainfrom
fix/redis-semaphore-acquire-atomic

Conversation

@ideaship

Copy link
Copy Markdown
Contributor

Problem

RedisSemaphore.acquire() checked capacity and took a slot in two
separate Redis round-trips:

if self.redis.zcard(self.key) < self.maxsize:   # CHECK
    self.redis.zadd(self.key, {identifier: now}) # USE

This is a time-of-check-to-time-of-use race: two clients can both run
ZCARD, both observe a free slot, and both ZADD, leaving the sorted set
with more than maxsize members. The semaphore that bounds concurrent
NetBox connections (create_netbox_semaphore, maxsize = NETBOX_MAX_CONNECTIONS, default 5) therefore over-admits under contention
— exactly when the cap matters most. A local 50-client harness peaked at
22 holders against maxsize 5.

A second, opposite defect lived in the same method: the cleanup boundary
now - 60 was computed once before the retry loop, so it never advanced
while a caller waited. Holders that expired during the wait were never
reclaimed, causing spurious timeouts when capacity had actually freed.

Fix

Replace the check-and-acquire with a single server-side Lua EVAL that
performs expired-holder cleanup, the capacity check, and the slot
reservation in one atomic step, so no other client can interleave between
the check and the reservation. Because now is recomputed each iteration
and passed into the script, the reclaim boundary advances with wall-clock
time — fixing the second defect in the same change. The 60-second window
is named HOLDER_EXPIRY.

Testing

The race needs genuinely concurrent clients against a real Redis and is
not reproducible with a mock. It was validated locally with a threaded
real-Redis harness (over-admit on the old code, never on the fixed code).
The committed regression tests run the production Lua script against
in-memory fakeredis (added as a dev dependency, with lupa for Lua
support) and pin the two invariants: the holder set never exceeds
maxsize, and the reclaim boundary advances across retries so a slot
freed mid-wait is granted.

Notes

  • Surfaced while reviewing the test-only PR below; this is a pre-existing
    production defect in code that PR does not touch, so it is fixed
    separately:
  • That PR adds tests pinning the previous behavior (including the frozen
    cleanup boundary); whichever lands second will need those assertions
    reconciled against this change.

🤖 Generated with Claude Code

RedisSemaphore.acquire() checked capacity and took a slot in two
separate Redis round-trips:

    if self.redis.zcard(self.key) < self.maxsize:   # CHECK
        self.redis.zadd(self.key, {identifier: now}) # USE

This is a time-of-check-to-time-of-use race. Two clients can both run
ZCARD, both observe a free slot, and both proceed to ZADD, leaving the
sorted set with more than maxsize members. The semaphore that bounds
concurrent NetBox connections (create_netbox_semaphore, maxsize =
NETBOX_MAX_CONNECTIONS) therefore over-admits under contention -- a
local 50-client harness peaked at 22 holders against maxsize 5.

Replace the check-and-acquire with a single server-side Lua EVAL that
performs the expired-holder cleanup, the capacity check, and the slot
reservation in one atomic step, so no other client can interleave
between the check and the reservation.

The same change fixes a second, opposite defect in the method: the
cleanup boundary (now - 60) was computed once before the retry loop, so
it never advanced while a caller waited and holders that expired during
the wait were never reclaimed, causing spurious timeouts. now is now
recomputed on every iteration and passed to the script, so the reclaim
boundary advances with wall-clock time. The 60-second window is named
HOLDER_EXPIRY.

The race itself requires genuinely concurrent clients against a real
Redis and is not reproducible with a mock; it was validated locally
with a threaded real-Redis harness. The committed regression tests run
the production Lua script against in-memory fakeredis (added as a dev
dependency, with lupa for Lua support) and pin the two invariants the
fix guarantees: the holder set never exceeds maxsize, and the reclaim
boundary advances across retries so a slot freed mid-wait is granted.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Roger Luethi <luethi@osism.tech>

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • Consider using a monotonic clock (e.g. time.monotonic()) for the retry loop and now passed into the Lua script to avoid issues if the system clock is adjusted during acquisition retries.
  • If this Lua script will be invoked frequently, you may want to load it and use EVALSHA instead of eval to reduce script parsing overhead on Redis and make it easier to reuse across RedisSemaphore instances.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider using a monotonic clock (e.g. `time.monotonic()`) for the retry loop and `now` passed into the Lua script to avoid issues if the system clock is adjusted during acquisition retries.
- If this Lua script will be invoked frequently, you may want to load it and use `EVALSHA` instead of `eval` to reduce script parsing overhead on Redis and make it easier to reuse across `RedisSemaphore` instances.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

2 participants