Skip to content

fix(keys): avoid duplicate metrics from KeyIndex desync#14

Merged
AlinsRan merged 3 commits into
api7:mainfrom
AlinsRan:fix/keyindex-duplicate-metrics
Jun 23, 2026
Merged

fix(keys): avoid duplicate metrics from KeyIndex desync#14
AlinsRan merged 3 commits into
api7:mainfrom
AlinsRan:fix/keyindex-duplicate-metrics

Conversation

@AlinsRan

@AlinsRan AlinsRan commented Jun 22, 2026

Copy link
Copy Markdown

Problem

KeyIndex could emit the same key more than once, which surfaces downstream as duplicate Prometheus metrics. This fixes the upstream report apache/apisix#11934 ("apisix reports duplicate metrics"), whose root cause lives in this fork rather than in apisix itself.

Root cause

Two issues combine:

  1. KeyIndex:list() iterated the stored key values rather than the index. It looped over self.keys (keyed by slot number) and copied every value. self.keys can transiently hold the same key value at two different slots — e.g. when an expired metric is re-added at a new slot before the old slot is reclaimed — so the duplicated value was emitted twice.

    for _, v in pairs(self.keys) do
      copy[i] = v
      i = i + 1
    end
  2. delete_count was not bumped when an expired metric was re-added. In add(), when a key whose shared-dict entry had already expired is re-added, the code cleared the local index/keys/expire_keys but did not clear the dict slot or bump delete_count:

    if not ok then
      local idx = self.index[key]
      self.index[key] = nil
      self.keys[idx] = nil
      self.expire_keys[idx] = nil
      expired = true
    end

    Because delete_count was unchanged, other workers never triggered a full sync (KeyIndex:sync only does a full re-sync when self.deleted ~= delete_count). They kept the stale slot in their local self.keys while the metric was re-added at a new slot — desynchronizing the index and causing the duplicate emission described above.

Fix

  1. list() now walks the slots but only emits a key at the slot the index currently points at (self.index[key] == idx), so each key is listed exactly once while preserving ordering.
  2. add() now mirrors remove() on the expired-re-add path: it clears the dict slot, increments the local deleted counter and bumps delete_count, so other workers do a full sync and reclaim the stale slot.

Public API and observable behavior are unchanged.

Test

Adds TestKeyIndex:testExpiredReAddNoDuplicate, which:

  • adds a key with an exptime and syncs,
  • lets it expire in the shared dict without going through remove() (so key_count/delete_count are untouched and the next sync() is a no-op, leaving the index pointing at the vanished slot),
  • re-adds the key (hitting the expired branch),
  • asserts delete_count was bumped and that list() returns the key exactly once.

Verified fail-before / pass-after:

  • Unpatched: delete_count == nil and list() returns ["expkey", "expkey"] (test fails).
  • Patched: delete_count == 1 and list() returns ["expkey"] (test passes).

Full unit suite and luacheck (per CI config) pass; the only remaining failure is the pre-existing, unrelated TestPrometheus.testPrintfTable (a luaunit table-formatting difference present on main before this change).

Follow-up

Once a release is cut, an apisix-side rockspec bump will follow to pull in the fixed version.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed a case where re-adding an already-expired key could lead to duplicate metric/list emissions.
    • Improved key enumeration to avoid duplicates when stale slot data is present, ensuring each active key is returned exactly once per listing and stays consistent across workers.
  • Tests
    • Added regression coverage for re-adding keys after expiration (without an explicit remove), including multi-worker deduplication verification.
    • Added a direct test to confirm listings de-duplicate when the same key temporarily appears in multiple slots.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 241feed3-8d5f-48bd-8f1f-5e12eab2be86

📥 Commits

Reviewing files that changed from the base of the PR and between 5d44240 and 1d01057.

📒 Files selected for processing (2)
  • prometheus_keys.lua
  • prometheus_test.lua

📝 Walkthrough

Walkthrough

KeyIndex:list() is changed from iterating self.keys via pairs() to walking numeric slots 0..self.last and emitting only on canonical index matches. KeyIndex:add() now clears the stale shared-dict slot and increments delete_count when expire() fails on an expired key. Two regression tests cover both the full expiry-and-re-add scenario and direct list deduplication.

Changes

Duplicate Key Emission Fix

Layer / File(s) Summary
KeyIndex deduplication logic
prometheus_keys.lua
list() traverses slots 0..self.last and checks self.index[key] == idx to emit each key exactly once. add() on the expired-key path clears the stale slot, increments delete_count, and allocates a new slot to force cross-worker resync.
Test helper and regression tests
prometheus_test.lua
SimpleDict:expire returns true. testExpiredReAddNoDuplicate validates the full expired re-add scenario with new slot, incremented delete_count, and de-duplicated output. testListDeduplicatesStaleSlot directly asserts list deduplication when slots contain duplicates.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
E2e Test Quality Review ⚠️ Warning Missing E2E test (Blocking). Tests verify KeyIndex.list() deduplication at unit level but not the actual duplicate metrics symptom in Prometheus.collect() output. Other criteria (readability, error... Add E2E test: create/increment Prometheus metric with exptime, let it expire, re-add it, call collect(), verify metric appears exactly once in output (not duplicated).
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(keys): avoid duplicate metrics from KeyIndex desync' directly and clearly summarizes the main change: fixing a bug that caused duplicate Prometheus metrics due to KeyIndex desynchronization.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Check ✅ Passed No security vulnerabilities detected. PR fixes a shared resource synchronization bug; contains no credential/secret leakage, DB operations, auth bypasses, TLS misconfigurations, or secret reference...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

KeyIndex could emit the same key more than once, surfacing as duplicate
metrics (apache/apisix#11934). Two issues combine to cause this:

1. KeyIndex:list() iterated the stored key values (self.keys), which can
   transiently hold the same key value at two different slots when an
   expired metric is re-added at a new slot before the old slot is
   reclaimed. The duplicated value was then emitted twice. list() now
   walks the slots but only emits a key at the slot the index points at
   (self.index[key] == idx), so each key is listed exactly once while
   preserving ordering.

2. When add() re-adds a key whose shared-dict entry had already expired,
   it cleared the local index/keys but did not bump delete_count or clear
   the dict slot. Other workers therefore never did a full sync and kept
   the stale slot in their local self.keys, desynchronizing the index.
   add() now mirrors remove(): it clears the dict slot, increments the
   local deleted counter and bumps delete_count.

Adds a regression test (TestKeyIndex:testExpiredReAddNoDuplicate) that
reproduces the duplicate emission before the fix and passes after.
@AlinsRan AlinsRan force-pushed the fix/keyindex-duplicate-metrics branch from 5cbbc18 to 57ec06c Compare June 22, 2026 03:02
@AlinsRan AlinsRan marked this pull request as ready for review June 22, 2026 05:33

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@prometheus_test.lua`:
- Line 817: The KeyIndex:add() method call on self.key_index is not validating
the return value directly. Instead of relying on checking ngx.logs which is
global state and can be stale, capture the return value from
self.key_index:add("expkey", "eviction_err", 1) and assert it directly to verify
the operation succeeded or failed as expected. Apply the same fix to the similar
calls around lines 831-832.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c3b895e6-734f-4c98-bdd8-150cbbf6348d

📥 Commits

Reviewing files that changed from the base of the PR and between f86aa3e and 57ec06c.

📒 Files selected for processing (2)
  • prometheus_keys.lua
  • prometheus_test.lua

Comment thread prometheus_test.lua Outdated
AlinsRan added 2 commits June 22, 2026 13:51
Address review: KeyIndex:add() reports failures via its return value, so
assert it directly instead of checking the global ngx.logs which can be
stale/unrelated.
…t rc

Address review feedback on the duplicate-metrics fix:

- list(): iterate self.keys via pairs() instead of scanning 0..self.last.
  self.last grows monotonically with every add and is never reclaimed, so
  the slot-range scan was O(total slots ever created) on long-lived,
  high-churn workers. pairs() keeps it O(live keys) with the same
  index-based de-duplication. It also avoids a latent miss: sync() can
  reset self.last to key_count, below the highest populated local slot.

- add() expired path: only treat the key as expired when expire() returns
  err == "not found". A non-"not found" failure no longer deletes a slot
  that may still be live (which would now also drop the metric); it is
  logged and the existing slot is kept to avoid creating a duplicate.

- add() expired path: check the delete_count incr() return value and
  propagate eviction/forcible like remove() does, per the shared-dict
  error-handling convention. Drop the redundant dict:set(slot, nil): when
  expire() reports "not found" the slot is already gone.

- tests: fix SimpleDict:expire to return true on success (matches
  ngx.shared.DICT); add a second-worker convergence assertion to the
  regression test; add testListDeduplicatesStaleSlot to isolate the
  list() guard.
@AlinsRan AlinsRan requested review from membphis and nic-6443 June 22, 2026 09:28

@membphis membphis left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. No blocking issues found.

@AlinsRan AlinsRan merged commit 845084f into api7:main Jun 23, 2026
4 checks passed
@AlinsRan AlinsRan deleted the fix/keyindex-duplicate-metrics branch June 23, 2026 06:22
AlinsRan added a commit to AlinsRan/apisix that referenced this pull request Jun 24, 2026
… duplicate metrics

nginx-lua-prometheus-api7 0.20260623 fixes duplicate metric emission caused by
KeyIndex desync (api7/nginx-lua-prometheus#14): when a TTL'd metric expires in
the shared dict and is re-added at a new slot, a stale slot lingering in another
worker's local key index makes the same series be listed twice -- the repeated
apisix_http_status{...} lines reported in apache#11934.

The desync is cross-worker (it surfaces with multiple workers/pods, as in the
report) and is covered by the library's own unit test; it cannot be reproduced
by a single-worker APISIX integration test, so no t/ case is added here.

Fixes apache#11934
AlinsRan added a commit to AlinsRan/apisix that referenced this pull request Jun 24, 2026
… duplicate metrics

nginx-lua-prometheus-api7 0.20260623 fixes duplicate metric emission caused by
KeyIndex desync (api7/nginx-lua-prometheus#14): when a TTL'd metric expires in
the shared dict and is re-added at a new slot, a stale slot lingering in a
worker's local key index makes the same series be listed twice -- the repeated
apisix_http_status{...} lines reported in apache#11934.

The desync is normally cross-worker, but it reproduces deterministically in a
single Lua VM. t/plugin/prometheus-metric-expire.t drives prometheus_keys into
the two-slots-one-key state and asserts KeyIndex:list() emits the key exactly
once: it returns 2 against 0.20250302 (the bug) and 1 against 0.20260623.

Fixes apache#11934
AlinsRan added a commit to AlinsRan/apisix that referenced this pull request Jun 24, 2026
… duplicate metrics

nginx-lua-prometheus-api7 0.20260623 fixes duplicate metric emission caused by
a KeyIndex desync (api7/nginx-lua-prometheus#14): when a TTL'd metric expires
in the shared dict and is re-added at a new slot while a stale slot lingers
before another worker's incremental-sync cursor, that worker lists the same
series twice -- the repeated apisix_http_status{...} lines reported in apache#11934.

The fix is verified by the library's own unit tests
(testListDeduplicatesStaleSlot and testExpiredReAddNoDuplicate in
prometheus_test.lua). The duplicate is a cross-worker race that only arises
from a specific shared-dict sync interleaving and is not reproducible by a
single-worker (or even a naive multi-worker) APISIX integration test -- the
shared-dict sync reconciles the stale slot away in any straightforward
sequence -- so no t/ case is added here.

Fixes apache#11934
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants