Skip to content

Recover Codex token counts for backfill and normalize token usage: net input + one carrier row per response#136

Open
bgmcmullen wants to merge 1 commit into
masterfrom
token-usage-normalization
Open

Recover Codex token counts for backfill and normalize token usage: net input + one carrier row per response#136
bgmcmullen wants to merge 1 commit into
masterfrom
token-usage-normalization

Conversation

@bgmcmullen

Copy link
Copy Markdown
Contributor

Makes attributes.usage on ai_gateway_messages analyzable without
provider/capture-mode special-casing, and recovers Codex token counts for
backfill. Motivated by an analyst hitting a silent ~2.5× overcount and the
wrong storage path. New design doc: LLP 0035; revises LLP 0026.

What & why

Token usage was a confidently-wrong-numbers trap:

  • usage.input_tokens meant uncached input for Claude but cache-inclusive
    for Codex — same column, two meanings.
  • Claude duplicated the response-level usage onto every content-block row,
    so a naive SUM over-counted ~2.5×; Codex put it on one row → inconsistent.
  • The documented recipe read raw_frame.message.usage, which is null for
    Claude live and all Codex (only Claude backfill has the raw frame, under the
    flat raw_frame.message_id).

Two invariants fix it (LLP 0035):

  1. Net input, everywhere. input_tokens is uncached for both providers;
    cache rides cache_read_tokens (+ cache_write_tokens for Claude). So
    input + cache_read [+ cache_write] = total prompt dataset-wide, and
    input + cache_read + output == total_tokens reconciles.
  2. One carrier per response, on the last assistant row — the terminal
    output item (tool_use, else final text), the same row as stop_reason.
    A plain SUM(attributes.usage.*) is now correct, and Claude and Codex render
    identically in the table.

Changes

Codex backfill — recover per-turn usage from token_count rollout events
(last_token_usage, never the cumulative total_token_usage); net-input
normalization; stamp on the turn's last assistant row.

Codex live (exchange-projector.js) — net-input normalization; usage on the
last assistant message (stampUsageOnLastAssistant).

Claude live + backfill — stamp response-level usage once, on the last
block of each API message (alongside stop_reason), instead of every block.

Gateway (message_projector.js) — expandMessageParts stamps usage on
only the last part of a multi-block message (stripUsage), making the
one-carrier rule structural. (Claude backfill does emit multi-block carrier
messages — reasoning + text, reasoning + tool_use, parallel tool calls —
which were the only rows a plain SUM over-counted.)

Docs / memory — new LLP 0035 (incl. the canonical attributes.usage query);
LLP 0026 consequence revised; the reference-token-accounting-method memory
corrected (attributes.usage, not raw_frame).

Compatibility

No in-app consumer reads attributes.usage — verified across context graph,
enrichment, sinks, datasets, and vector search; it's only ad-hoc/skill SQL.
The max()-per-COALESCE(raw_frame.message_id, message_id) rollup still works
on both old (duplicated) and new (single-carrier) data.

Testing

New tests across all four projector paths + the gateway multi-block carrier
case; validated against a real Codex rollout (32 responses, sums reconcile).
Full suite green except 11 pre-existing iceberg/vector failures (missing native
deps), unrelated to this change.

🤖 Generated with Claude Code

@bgmcmullen bgmcmullen requested a review from philcunliffe June 24, 2026 05:58
@philcunliffe

Copy link
Copy Markdown
Contributor

Dual-agent review — request_changes

  • Verdict: request_changes
  • Risk class: medium
  • Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source Finding (severity, evidence) Intersects
Claude/Codex last vs total token_usage not distinguished in tests (minor; codex-backfill.test.js:344) Targets: codex codexUsageFromTokenCount; Risk: silent drop / per-turn correctness
Claude no multi-turn coverage for turnStartIndex (minor; codex/backfill.js:495) Targets: codex projectedExchangeFromSession; Risk: silent usage drop edge
Contract (sub-threshold, conf 70) stale analyst-agent.md MAX-dedupe guidance (minor) Risk: stale in-repo consumer guidance
Contract (sub-threshold, conf 55) live/backfill carrier-selection asymmetry, benign today (nit) Risk: silent drop / convergence invariant
Codex review

Fix Validations

Claude usage overcount from per-block duplication

  • Status: correct
  • Evidence: hypaware-core/plugins-workspace/claude/src/projector.js:381, hypaware-core/plugins-workspace/claude/src/backfill.js:307, hypaware-core/plugins-workspace/ai-gateway/src/message_projector.js:727
  • Assessment: Live, backfill, and gateway multi-block expansion now converge on one usage carrier: the last block/part only.

Codex backfill token-count recovery

  • Status: correct
  • Evidence: hypaware-core/plugins-workspace/codex/src/backfill.js:412, hypaware-core/plugins-workspace/codex/src/backfill.js:725, hypaware-core/plugins-workspace/codex/src/backfill.js:748, hypaware-core/plugins-workspace/codex/src/backfill.js:773
  • Assessment: The parser keeps token_count as a non-message boundary marker, uses last_token_usage rather than cumulative totals, normalizes cached input to net input, and stamps the last eligible assistant message.

Codex live usage normalization

  • Status: correct
  • Evidence: hypaware-core/plugins-workspace/codex/src/exchange-projector.js:223, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:284, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:526, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:564
  • Assessment: Chat and Responses both normalize OpenAI usage into attributes.usage and attach it once to the final assistant item.

Findings

9) Test Evidence Quality

  • Severity: minor
  • Confidence: high
  • Evidence: hypaware-core/plugins-workspace/codex/src/backfill.js:725, test/plugins/codex-backfill.test.js:321, test/plugins/codex-backfill.test.js:344
  • Why it matters: The PR’s key Codex backfill fix is “use last_token_usage, not cumulative total_token_usage,” but the new test sets both fields to the same object, so a regression back to cumulative accounting would still pass.
  • Suggested fix: Add a two-turn rollout test where each last_token_usage is a delta and total_token_usage is cumulative, then assert the projected/summed usage equals the deltas only.

No Finding

    1. Behavioral Correctness
    1. Contract & Interface Fidelity
    1. Change Impact / Blast Radius
    1. Concurrency, Ordering & State Safety
    1. Error Handling & Resilience
    1. Security Surface
    1. Resource Lifecycle & Cleanup
    1. Release Safety
    1. Architectural Consistency
    1. Debuggability & Operability

Evidence Bundle

  • Changed hot paths: expandMessageParts usage stripping, Claude live/backfill assistant projection, Codex rollout token_count parsing/projection, Codex OpenAI Chat/Responses usage projection.
  • Impacted callers: hypaware-core/plugins-workspace/ai-gateway/src/message_projector.js:451, hypaware-core/plugins-workspace/claude/src/projector.js:198, hypaware-core/plugins-workspace/claude/src/backfill.js:320, hypaware-core/plugins-workspace/codex/src/backfill.js:183, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:196, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:198
  • Impacted tests: test/plugins/ai-gateway-message-projector.test.js:497, test/plugins/claude-backfill.test.js:368, test/plugins/claude-projector-identity.test.js:87, test/plugins/codex-backfill.test.js:321, test/plugins/codex-exchange-projector.test.js:59, test/plugins/codex-exchange-projector.test.js:149, test/plugins/codex-exchange-projector.test.js:248
  • Unresolved uncertainty: I did not run the suite; review is static against the supplied diff plus targeted trace searches.
Claude review

Claude review

Five parallel review lenses (guidance, shallow-bug, history, contract/callers,
comments/tests) plus a blast-radius pass. Only findings scoring ≥80 confidence
are kept below. The implementation itself was judged correct across all four
projection paths by every lens and independently by Codex; the surviving
findings are test-evidence gaps, not code defects.

Codex backfill test cannot catch a regression to cumulative token totals

  • Severity: minor
  • Confidence: 88
  • Evidence: test/plugins/codex-backfill.test.js:344
  • Why it matters: The PR's core per-turn rule (LLP 0035#per-turn) is "read last_token_usage, NOT cumulative total_token_usage" — the explicit multiply-count trap — but the only token_count test sets last_token_usage and total_token_usage to the same object, so a regression that read the cumulative total instead would still pass. Codex independently flagged this exact gap.
  • Suggested fix: Make the two fields differ (a two-turn rollout, or total_token_usage larger than last_token_usage) and assert the stamped usage equals last_token_usage only.

No multi-turn coverage for the codex backfill turn boundary (turnStartIndex)

  • Severity: minor
  • Confidence: 80
  • Evidence: hypaware-core/plugins-workspace/codex/src/backfill.js:495, test/plugins/codex-backfill.test.js:321
  • Why it matters: The new turnStartIndex advance is the logic that stops turn 2's usage from being stamped onto turn 1's last assistant row; with only a single-turn token_count test, a bug that mis-attributes a later turn's usage to an earlier turn (or double-stamps) would not be caught.
  • Suggested fix: Add a rollout with two turns each ending in a token_count with distinct usage, and assert each turn's last assistant message carries its own turn's usage with no earlier row overwritten.

Sub-threshold (noted, not counted)

  • Stale consumer doc (conf 70): claude/agents/hypaware-analyst.md:39 still tells querying agents to MAX-dedupe usage "because it repeats across parts" — the pre-PR shape. MAX still returns correct numbers, but the shipped query-guidance doc was not updated alongside LLP 0035/0026. Worth a one-line fix.
  • Live/backfill carrier asymmetry (conf 55): codex live stampUsageOnLastAssistant stamps the last assistant of any block type; backfill stampUsageOnTurn skips reasoning-only. Verified benign today (live Responses never emits reasoning-only assistant messages) but rests on an implicit invariant — a one-line note or shared predicate would harden it.
  • Window-filter edge (conf 55): a --since/--until boundary between a turn's messages and its later token_count marker can drop or (rarely) mis-attribute that turn's usage in a bounded backfill; full-history backfill is unaffected.

Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr-136/dual-review/pr-136

@philcunliffe

Copy link
Copy Markdown
Contributor

🧭 Decision map — where to spend your attention

Companion to the dual-review verdict. This casts no verdict — it points at the 5 forks where the author made a real choice, so you can skim the rest.

Scanned: 31 hunks across 14 files (6 source, 5 test, 3 LLP). Most is mechanical: the test fixtures, the LLP corpus (new 0035 + 0026/0000 edits), JSDoc/.d.ts types, and value-helper boilerplate (numberValue, copyNumberAlias, firstPlainObject, mergeJsonObjects). The decisions worth your eyes, in order:

1. input_tokens stored NET of cache (gross − cached, floored at 0) · contract / shape

codex/src/exchange-projector.js:528 · codex/src/backfill.js:748

usage.input_tokens = cachedInput !== undefined ? Math.max(0, grossInput - cachedInput) : grossInput
  • Decision: Subtract the cached reads from the provider's gross input and store the remainder, with cached on cache_read_tokens — making Codex/OpenAI match Claude's already-net convention.
  • Alternative not taken: Store provider-native gross (zero transform), or "gross everywhere" (fold Claude's cache into input). LLP 0035 weighed and rejected both.
  • Check: Looks like one arithmetic line; actually redefines an analytic column. (a) Codex input_tokens now ≠ the OpenAI dashboard (add cache_read_tokens back) — intended, but confirm no consumer assumes raw. (b) Math.max(0, …) silently floors to 0 if cached > gross, breaking the net + cache_read + output == total identity — confirm that shape never occurs.

2. The one carrier row is the LAST assistant row · contract / shape

claude/src/projector.js:381 · ai-gateway/src/message_projector.js:727 · codex backfill:773

attributes: mergeJsonObjects(baseClientAttributes, isLast ? messageAttributes : nonCarrierAttributes),
  • Decision: Response-level usage rides exactly one row — the last assistant block of the response. Codex backfill was switched first→last to match Claude.
  • Alternative not taken: Keep duplicating usage on every block (pre-PR Claude behavior, forcing consumer dedupe), or stamp the first row.
  • Check: Correctness needs live and backfill to pick the same last row, else two non-deduping carriers. The predicates differ — live takes the last assistant of any block type; backfill (stampUsageOnTurn) skips reasoning-only. Benign today only because live Responses never emits a reasoning-only assistant message — an implicit invariant worth a guard.

3. Read last_token_usage, not total_token_usage · magic value / contract

codex/src/backfill.js:725

return codexUsageAttributes(info.last_token_usage)
  • Decision: Pull the token_count event's per-turn delta.
  • Alternative not taken: The sibling info.total_token_usage (cumulative running total) — the multiply-count trap when summed across a conversation.
  • Check: Confirm last_token_usage is genuinely the per-turn delta in real rollouts. Both reviewers noted the new test sets last==total, so it doesn't actually guard this regression.

4. token_count kept as a non-projecting turn-boundary marker · contract / shape · unhappy-path

codex/src/backfill.js:412

} else if (type === 'event_msg' && payload) {
  • Decision: Of all event_msg records (previously all skipped), retain exactly one — token_count — as a synthetic item that preserves its stream slot, carries usage, and never projects a row.
  • Alternative not taken: Project it as its own row; attach usage at parse time to the preceding item; or keep skipping it (status quo = no Codex-backfill usage).
  • Check: The synthetic marker must never leak as a row, and all other event_msg types must still be skipped — there's no negative-path test for a non-token_count event_msg.

5. Drop usage rather than mis-attribute it · unhappy-path policy

codex/src/backfill.js:773

if (message.role !== 'assistant' || !hasTextOrToolUse(message)) continue
  • Decision: A turn with no text/tool_use assistant message silently discards its usage.
  • Alternative not taken: Fall back to the last assistant of any kind (what the live path does), or stamp the turn's first message.
  • Check: Accepts a silent under-count for reasoning-only / windowed-out turns ("data loss over corruption"). Confirm such turns are rare; it also widens the live/backfill predicate gap from [codex] Remove OpenTelemetry npm dependencies #2.

Honorable mentions (real but lower-stakes): codex/src/exchange-projector.js:285,486 — Responses usage read body-first with stream fallback, accepting both response.completed.response.usage and payload-direct shapes (last terminal event wins); codex/src/backfill.js:754total_tokens passed through raw to preserve the reconciliation identity; llp/0026 — the consequence rewrite that supersedes the old "dedupe by API message id" rule consumers may have coded against.

Generated by /decision-map. Advisory — directs attention, casts no verdict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants