Skip to content

Clock-usage audit: wall vs monotonic time issues across the SDK #5530

@runningcode

Description

@runningcode

Audit of clock usage across sentry-java (as of b988b37): wall clock used to measure intervals, monotonic time converted to dates, and the dual-semantics ICurrentDateProvider interface. Each finding is classified actual bug vs theoretical with an urgency rating.

Found: 1 design flaw (root cause), 4 actual bugs, ~10 theoretical issues. Notably, every current ICurrentDateProvider pairing turned out to be correct — but only via careful manual wiring that nothing enforces.


A. Root design hazard

A1. ICurrentDateProvider has two implementations with different clock semantics — Design flaw · HIGH

CurrentDateProvider returns System.currentTimeMillis() (wall/epoch); AndroidCurrentDateProvider returns SystemClock.uptimeMillis() (monotonic, pauses in deep sleep) — same method name getCurrentTimeMillis().

Every consumer must hand-pick the impl matching whatever it compares against:

  • AnrV2Integration / TombstoneIntegration deliberately use the wall one (warning comment: "AppExitInfo uses System.currentTimeMillis") to compare against ApplicationExitInfo.getTimestamp().
  • LifecycleWatcher needs wall because it seeds lastUpdatedSession from Session.getStarted().getTime() (epoch).
  • AndroidEnvelopeCache needs uptime because it subtracts TimeSpan.getStartUptimeMs() (uptime base) for startup-crash detection.
  • ANRWatchDog bypasses both with an inline () -> SystemClock.uptimeMillis() lambda.
  • RateLimiter and ReplayIntegration get the wall one even on Android (correct — both need epoch).

A wrong wiring is silent (tests inject fakes) and catastrophic — e.g. uptime fed into Timer.schedule(task, absoluteDate) produces a 1970 date that fires instantly.

Fix: split into two explicitly-named interfaces (e.g. wall epochMillis() vs monotonic elapsedMillis()), or rename methods per impl so a mismatch can't compile.


B. Actual bugs

B1. java.util.Timer for transaction idle/deadline timeouts — MEDIUM-HIGH

SentryTracer.java#L102 (also scheduleFinish/scheduleDeadlineTimeout). Timer deadlines are wall-clock based and its internal Object.wait() does not progress during Android deep sleep:

  • App backgrounded mid-ui.load → device sleeps before the 30s deadline fires → timer fires at wake (potentially hours later); forceFinish stamps unfinished spans with dateProvider.now() → multi-hour transactions/spans (the classic "absurdly long ui.load transaction" artifact).
  • Wall-clock steps (NTP/user) shift firing on any platform.

Fix: schedule on SentryExecutorService (nanoTime-based delays); clamp finish timestamps when the deadline fires late.

B2. Session-end timer has the same Timer mechanics — MEDIUM

LifecycleWatcher.java#L121. Device sleeps within the 30s background window → session ends only at wake; Session.end() stamps wake time → inflated session durations in release health; replay stop() and ContinuousProfiler.close(false) also run hours late. The foreground check lastUpdatedSession + sessionIntervalMillis <= now is also a wall-clock interval (clock step → spurious or missed session rotation).

B3. Session Replay: wall clock used for all interval math — MEDIUM (actual bug when the clock steps mid-recording)

Epoch is required for RRWeb payload timestamps (that part is correct), but the same wall values also drive windows/durations:

  • SessionCaptureStrategy.kt#L81 (+L106, L162): segment durations and the 1h max-session deadline are now - startEpoch diffs.
  • BufferCaptureStrategy trim-to-last-30s and ReplayCache.createVideoOf iterate epoch-millis windows; frame files are named by epoch millis.
  • ReplayGestureConverter.kt#L56: timeOffset = now - touchMoveBaseline.

A backward step mid-recording → frames "newer than now": trim can wipe valid frames, segment windows miss/duplicate frames, gesture offsets go negative. Forward step → premature 1h cutoff. NTP/carrier/user steps on phones are realistic.

Fix: keep epoch in RRWeb payloads, drive windows/trim from a monotonic clock with one epoch anchor per segment.

B4. Cron check-in durations measured with wall clock — LOW-MEDIUM

CheckInUtils.java#L64 and the same pattern in SentryCheckInAdvice (sentry-spring, -jakarta, -7): duration = currentTimeMillis() - start. Cron jobs run long → wide exposure to clock steps → wrong/negative durations. Pure interval → should be System.nanoTime().


C. Theoretical issues (need a clock step / edge condition)

  1. AndroidProfiler measurement re-anchoring · LOW-MEDIUMAndroidProfiler.java#L309: timestampDiff (elapsedRealtime↔wall offset) computed once at profile end and applied to every wall-stamped PerformanceCollectionData sample → a wall step during the profile shifts all earlier CPU/memory samples relative to the trace. Fix: stamp samples with elapsedRealtimeNanos on Android.
  2. Session seq is raw epoch millis · LOW-MEDIUMSession.java#L309: backward step between updates → newer update has smaller seq → server can discard the latest session state (lost end/error counts). Also calculateDurationTime uses Math.abs, masking negative durations. Fix: seq = max(prevSeq + 1, now).
  3. RateLimiter wall-clock deadlines · LOW — retry-after stored as epoch Dates (self-consistent), but a backward step silently extends the drop window; the "limit lifted" observer callback is Timer.schedule(task, absoluteDate) → shifted by steps (continuous profilers resume late).
  4. App-start anchor projection · LOWTimeSpan (uptime + wall anchor) is the right pattern, but the wall anchor is captured once and setStartedAt() back-projects assuming no step since process start; NTP sync shortly after boot shifts app-start span timestamps relative to later-anchored spans.
  5. SpanFrameMetricsCollector.toNanoTime() · LOW — re-anchors wall-based SentryLongDates into the nanoTime base using the current offset; wrong by any step since the date was created, and across deep sleep → frames attributed to wrong span windows.
  6. Wall-clock TTLs/cleanup · LOWHostnameCache 5h TTL; Sentry.classCreationTimestamp vs File.lastModified() for profiling-traces cleanup; CacheStrategy envelope rotation ordered by lastModified(); DefaultCompositePerformanceCollector 30s auto-stop via wall diff + sampling on java.util.Timer.
  7. InformationalBreadcrumb.compareTo orders by captured System.nanoTime() (restored-from-disk breadcrumbs get fresh nanos at parse → cross-restart ordering is parse-order); cross-type SentryDate arithmetic (SentryNanotimeDate vs SentryLongDate) silently degrades to ms-precision wall math.

D. Checked and confirmed correct (coverage)

ANRWatchDog (all-uptime); AnrV2/Tombstone 90-day threshold (wall vs wall); AndroidEnvelopeCache startup-crash window (uptime vs uptime); AndroidConnectionStatusProvider cache TTL (uptime); Debouncer on uptime (no events during sleep anyway); DeviceInfoUtil boot time (wall − elapsedRealtime); AndroidCpuCollector (elapsedRealtimeNanos deltas); AndroidProfiler per-frame clock conversion; span/transaction SentryNanotimeDate anchor pattern; LoggerBatchProcessor/BackpressureMonitor (ScheduledExecutorService); OkHttp HTTP_START/END_TIMESTAMP (deliberately epoch for RRWeb). Swept clean: apollo*, graphql*, kafka, quartz, spotlight, reactor, ndk, fragment, navigation, distribution, jul/logback/log4j2, async-profiler.


Suggested fix order

  1. A1 — split/rename ICurrentDateProvider semantics (prevents all future regressions; internal API).
  2. B1 + B2 — replace java.util.Timer with SentryExecutorService for tracer idle/deadline + session-end (and C3's observer timer); clamp late-fire timestamps.
  3. B3 — monotonic windows in replay capture strategies.
  4. B4nanoTime for check-in durations (CheckInUtils + 3 spring files).
  5. C1/C2 — profiler measurement stamping; session seq monotonicity.
  6. Rest are doc comments / opportunistic.

Verification idea for fixes: unit tests already inject ICurrentDateProvider fakes everywhere — add cases simulating backward/forward clock steps and assert intervals are unaffected.

Metadata

Metadata

Assignees

No one assigned
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions