Recover Codex token counts for backfill and normalize token usage: net input + one carrier row per response by bgmcmullen · Pull Request #136 · hyparam/hypaware

bgmcmullen · 2026-06-24T05:58:13Z

Makes attributes.usage on ai_gateway_messages analyzable without
provider/capture-mode special-casing, and recovers Codex token counts for
backfill. Motivated by an analyst hitting a silent ~2.5× overcount and the
wrong storage path. New design doc: LLP 0035; revises LLP 0026.

What & why

Token usage was a confidently-wrong-numbers trap:

usage.input_tokens meant uncached input for Claude but cache-inclusive
for Codex — same column, two meanings.
Claude duplicated the response-level usage onto every content-block row,
so a naive SUM over-counted ~2.5×; Codex put it on one row → inconsistent.
The documented recipe read raw_frame.message.usage, which is null for
Claude live and all Codex (only Claude backfill has the raw frame, under the
flat raw_frame.message_id).

Two invariants fix it (LLP 0035):

Net input, everywhere. input_tokens is uncached for both providers;
cache rides cache_read_tokens (+ cache_write_tokens for Claude). So
input + cache_read [+ cache_write] = total prompt dataset-wide, and
input + cache_read + output == total_tokens reconciles.
One carrier per response, on the last assistant row — the terminal
output item (tool_use, else final text), the same row as stop_reason.
A plain SUM(attributes.usage.*) is now correct, and Claude and Codex render
identically in the table.

Changes

Codex backfill — recover per-turn usage from token_count rollout events
(last_token_usage, never the cumulative total_token_usage); net-input
normalization; stamp on the turn's last assistant row.

Codex live (exchange-projector.js) — net-input normalization; usage on the
last assistant message (stampUsageOnLastAssistant).

Claude live + backfill — stamp response-level usage once, on the last
block of each API message (alongside stop_reason), instead of every block.

Gateway (message_projector.js) — expandMessageParts stamps usage on
only the last part of a multi-block message (stripUsage), making the
one-carrier rule structural. (Claude backfill does emit multi-block carrier
messages — reasoning + text, reasoning + tool_use, parallel tool calls —
which were the only rows a plain SUM over-counted.)

Docs / memory — new LLP 0035 (incl. the canonical attributes.usage query);
LLP 0026 consequence revised; the reference-token-accounting-method memory
corrected (attributes.usage, not raw_frame).

Compatibility

No in-app consumer reads attributes.usage — verified across context graph,
enrichment, sinks, datasets, and vector search; it's only ad-hoc/skill SQL.
The max()-per-COALESCE(raw_frame.message_id, message_id) rollup still works
on both old (duplicated) and new (single-carrier) data.

Testing

New tests across all four projector paths + the gateway multi-block carrier
case; validated against a real Codex rollout (32 responses, sums reconcile).
Full suite green except 11 pre-existing iceberg/vector failures (missing native
deps), unrelated to this change.

🤖 Generated with Claude Code

philcunliffe · 2026-06-24T06:48:08Z

Dual-agent review — `request_changes`

Verdict: request_changes
Risk class: medium
Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source	Finding (severity, evidence)	Intersects
Claude/Codex	last vs total token_usage not distinguished in tests (minor; codex-backfill.test.js:344)	Targets: codex `codexUsageFromTokenCount`; Risk: silent drop / per-turn correctness
Claude	no multi-turn coverage for `turnStartIndex` (minor; codex/backfill.js:495)	Targets: codex `projectedExchangeFromSession`; Risk: silent usage drop edge
Contract (sub-threshold, conf 70)	stale analyst-agent.md MAX-dedupe guidance (minor)	Risk: stale in-repo consumer guidance
Contract (sub-threshold, conf 55)	live/backfill carrier-selection asymmetry, benign today (nit)	Risk: silent drop / convergence invariant

Codex review

Fix Validations

Claude usage overcount from per-block duplication

Status: correct
Evidence: hypaware-core/plugins-workspace/claude/src/projector.js:381, hypaware-core/plugins-workspace/claude/src/backfill.js:307, hypaware-core/plugins-workspace/ai-gateway/src/message_projector.js:727
Assessment: Live, backfill, and gateway multi-block expansion now converge on one usage carrier: the last block/part only.

Codex backfill token-count recovery

Status: correct
Evidence: hypaware-core/plugins-workspace/codex/src/backfill.js:412, hypaware-core/plugins-workspace/codex/src/backfill.js:725, hypaware-core/plugins-workspace/codex/src/backfill.js:748, hypaware-core/plugins-workspace/codex/src/backfill.js:773
Assessment: The parser keeps token_count as a non-message boundary marker, uses last_token_usage rather than cumulative totals, normalizes cached input to net input, and stamps the last eligible assistant message.

Codex live usage normalization

Status: correct
Evidence: hypaware-core/plugins-workspace/codex/src/exchange-projector.js:223, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:284, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:526, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:564
Assessment: Chat and Responses both normalize OpenAI usage into attributes.usage and attach it once to the final assistant item.

Findings

9) Test Evidence Quality

Severity: minor
Confidence: high
Evidence: hypaware-core/plugins-workspace/codex/src/backfill.js:725, test/plugins/codex-backfill.test.js:321, test/plugins/codex-backfill.test.js:344
Why it matters: The PR’s key Codex backfill fix is “use last_token_usage, not cumulative total_token_usage,” but the new test sets both fields to the same object, so a regression back to cumulative accounting would still pass.
Suggested fix: Add a two-turn rollout test where each last_token_usage is a delta and total_token_usage is cumulative, then assert the projected/summed usage equals the deltas only.

No Finding

1. Behavioral Correctness
1. Contract & Interface Fidelity
1. Change Impact / Blast Radius
1. Concurrency, Ordering & State Safety
1. Error Handling & Resilience
1. Security Surface
1. Resource Lifecycle & Cleanup
1. Release Safety
1. Architectural Consistency
1. Debuggability & Operability

Evidence Bundle

Changed hot paths: expandMessageParts usage stripping, Claude live/backfill assistant projection, Codex rollout token_count parsing/projection, Codex OpenAI Chat/Responses usage projection.
Impacted callers: hypaware-core/plugins-workspace/ai-gateway/src/message_projector.js:451, hypaware-core/plugins-workspace/claude/src/projector.js:198, hypaware-core/plugins-workspace/claude/src/backfill.js:320, hypaware-core/plugins-workspace/codex/src/backfill.js:183, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:196, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:198
Impacted tests: test/plugins/ai-gateway-message-projector.test.js:497, test/plugins/claude-backfill.test.js:368, test/plugins/claude-projector-identity.test.js:87, test/plugins/codex-backfill.test.js:321, test/plugins/codex-exchange-projector.test.js:59, test/plugins/codex-exchange-projector.test.js:149, test/plugins/codex-exchange-projector.test.js:248
Unresolved uncertainty: I did not run the suite; review is static against the supplied diff plus targeted trace searches.

Claude review

Claude review

Five parallel review lenses (guidance, shallow-bug, history, contract/callers,
comments/tests) plus a blast-radius pass. Only findings scoring ≥80 confidence
are kept below. The implementation itself was judged correct across all four
projection paths by every lens and independently by Codex; the surviving
findings are test-evidence gaps, not code defects.

Codex backfill test cannot catch a regression to cumulative token totals

Severity: minor
Confidence: 88
Evidence: test/plugins/codex-backfill.test.js:344
Why it matters: The PR's core per-turn rule (LLP 0035#per-turn) is "read last_token_usage, NOT cumulative total_token_usage" — the explicit multiply-count trap — but the only token_count test sets last_token_usage and total_token_usage to the same object, so a regression that read the cumulative total instead would still pass. Codex independently flagged this exact gap.
Suggested fix: Make the two fields differ (a two-turn rollout, or total_token_usage larger than last_token_usage) and assert the stamped usage equals last_token_usage only.

No multi-turn coverage for the codex backfill turn boundary (`turnStartIndex`)

Severity: minor
Confidence: 80
Evidence: hypaware-core/plugins-workspace/codex/src/backfill.js:495, test/plugins/codex-backfill.test.js:321
Why it matters: The new turnStartIndex advance is the logic that stops turn 2's usage from being stamped onto turn 1's last assistant row; with only a single-turn token_count test, a bug that mis-attributes a later turn's usage to an earlier turn (or double-stamps) would not be caught.
Suggested fix: Add a rollout with two turns each ending in a token_count with distinct usage, and assert each turn's last assistant message carries its own turn's usage with no earlier row overwritten.

Sub-threshold (noted, not counted)

Stale consumer doc (conf 70): claude/agents/hypaware-analyst.md:39 still tells querying agents to MAX-dedupe usage "because it repeats across parts" — the pre-PR shape. MAX still returns correct numbers, but the shipped query-guidance doc was not updated alongside LLP 0035/0026. Worth a one-line fix.
Live/backfill carrier asymmetry (conf 55): codex live stampUsageOnLastAssistant stamps the last assistant of any block type; backfill stampUsageOnTurn skips reasoning-only. Verified benign today (live Responses never emits reasoning-only assistant messages) but rests on an implicit invariant — a one-line note or shared predicate would harden it.
Window-filter edge (conf 55): a --since/--until boundary between a turn's messages and its later token_count marker can drop or (rarely) mis-attribute that turn's usage in a bounded backfill; full-history backfill is unaffected.

Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr-136/dual-review/pr-136

philcunliffe · 2026-06-24T06:51:48Z

🧭 Decision map — where to spend your attention

Companion to the dual-review verdict. This casts no verdict — it points at the 5 forks where the author made a real choice, so you can skim the rest.

Scanned: 31 hunks across 14 files (6 source, 5 test, 3 LLP). Most is mechanical: the test fixtures, the LLP corpus (new 0035 + 0026/0000 edits), JSDoc/.d.ts types, and value-helper boilerplate (numberValue, copyNumberAlias, firstPlainObject, mergeJsonObjects). The decisions worth your eyes, in order:

1. `input_tokens` stored NET of cache (gross − cached, floored at 0) · contract / shape

codex/src/exchange-projector.js:528 · codex/src/backfill.js:748

usage.input_tokens = cachedInput !== undefined ? Math.max(0, grossInput - cachedInput) : grossInput

Decision: Subtract the cached reads from the provider's gross input and store the remainder, with cached on cache_read_tokens — making Codex/OpenAI match Claude's already-net convention.
Alternative not taken: Store provider-native gross (zero transform), or "gross everywhere" (fold Claude's cache into input). LLP 0035 weighed and rejected both.
Check: Looks like one arithmetic line; actually redefines an analytic column. (a) Codex input_tokens now ≠ the OpenAI dashboard (add cache_read_tokens back) — intended, but confirm no consumer assumes raw. (b) Math.max(0, …) silently floors to 0 if cached > gross, breaking the net + cache_read + output == total identity — confirm that shape never occurs.

2. The one carrier row is the LAST assistant row · contract / shape

claude/src/projector.js:381 · ai-gateway/src/message_projector.js:727 · codex backfill:773

attributes: mergeJsonObjects(baseClientAttributes, isLast ? messageAttributes : nonCarrierAttributes),

Decision: Response-level usage rides exactly one row — the last assistant block of the response. Codex backfill was switched first→last to match Claude.
Alternative not taken: Keep duplicating usage on every block (pre-PR Claude behavior, forcing consumer dedupe), or stamp the first row.
Check: Correctness needs live and backfill to pick the same last row, else two non-deduping carriers. The predicates differ — live takes the last assistant of any block type; backfill (stampUsageOnTurn) skips reasoning-only. Benign today only because live Responses never emits a reasoning-only assistant message — an implicit invariant worth a guard.

3. Read `last_token_usage`, not `total_token_usage` · magic value / contract

codex/src/backfill.js:725

return codexUsageAttributes(info.last_token_usage)

Decision: Pull the token_count event's per-turn delta.
Alternative not taken: The sibling info.total_token_usage (cumulative running total) — the multiply-count trap when summed across a conversation.
Check: Confirm last_token_usage is genuinely the per-turn delta in real rollouts. Both reviewers noted the new test sets last==total, so it doesn't actually guard this regression.

4. `token_count` kept as a non-projecting turn-boundary marker · contract / shape · unhappy-path

codex/src/backfill.js:412

} else if (type === 'event_msg' && payload) {

Decision: Of all event_msg records (previously all skipped), retain exactly one — token_count — as a synthetic item that preserves its stream slot, carries usage, and never projects a row.
Alternative not taken: Project it as its own row; attach usage at parse time to the preceding item; or keep skipping it (status quo = no Codex-backfill usage).
Check: The synthetic marker must never leak as a row, and all other event_msg types must still be skipped — there's no negative-path test for a non-token_count event_msg.

5. Drop usage rather than mis-attribute it · unhappy-path policy

codex/src/backfill.js:773

if (message.role !== 'assistant' || !hasTextOrToolUse(message)) continue

Decision: A turn with no text/tool_use assistant message silently discards its usage.
Alternative not taken: Fall back to the last assistant of any kind (what the live path does), or stamp the turn's first message.
Check: Accepts a silent under-count for reasoning-only / windowed-out turns ("data loss over corruption"). Confirm such turns are rare; it also widens the live/backfill predicate gap from [codex] Remove OpenTelemetry npm dependencies #2.

Honorable mentions (real but lower-stakes): codex/src/exchange-projector.js:285,486 — Responses usage read body-first with stream fallback, accepting both response.completed.response.usage and payload-direct shapes (last terminal event wins); codex/src/backfill.js:754 — total_tokens passed through raw to preserve the reconciliation identity; llp/0026 — the consequence rewrite that supersedes the old "dedupe by API message id" rule consumers may have coded against.

_{Generated by /decision-map. Advisory — directs attention, casts no verdict.}

token usage normalization

f8c7cab

bgmcmullen requested a review from philcunliffe June 24, 2026 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover Codex token counts for backfill and normalize token usage: net input + one carrier row per response#136

Recover Codex token counts for backfill and normalize token usage: net input + one carrier row per response#136
bgmcmullen wants to merge 1 commit into
masterfrom
token-usage-normalization

bgmcmullen commented Jun 24, 2026

Uh oh!

philcunliffe commented Jun 24, 2026

Fix Validations

Claude usage overcount from per-block duplication

Codex backfill token-count recovery

Codex live usage normalization

Findings

9) Test Evidence Quality

No Finding

Evidence Bundle

Claude review

Codex backfill test cannot catch a regression to cumulative token totals

No multi-turn coverage for the codex backfill turn boundary (`turnStartIndex`)

Sub-threshold (noted, not counted)

Uh oh!

philcunliffe commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bgmcmullen commented Jun 24, 2026

What & why

Changes

Compatibility

Testing

Uh oh!

philcunliffe commented Jun 24, 2026

Dual-agent review — request_changes

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Fix Validations

Claude usage overcount from per-block duplication

Codex backfill token-count recovery

Codex live usage normalization

Findings

9) Test Evidence Quality

No Finding

Evidence Bundle

Claude review

Codex backfill test cannot catch a regression to cumulative token totals

No multi-turn coverage for the codex backfill turn boundary (turnStartIndex)

Sub-threshold (noted, not counted)

Uh oh!

philcunliffe commented Jun 24, 2026

🧭 Decision map — where to spend your attention

1. input_tokens stored NET of cache (gross − cached, floored at 0) · contract / shape

2. The one carrier row is the LAST assistant row · contract / shape

3. Read last_token_usage, not total_token_usage · magic value / contract

4. token_count kept as a non-projecting turn-boundary marker · contract / shape · unhappy-path

5. Drop usage rather than mis-attribute it · unhappy-path policy

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dual-agent review — `request_changes`

No multi-turn coverage for the codex backfill turn boundary (`turnStartIndex`)

1. `input_tokens` stored NET of cache (gross − cached, floored at 0) · contract / shape

3. Read `last_token_usage`, not `total_token_usage` · magic value / contract

4. `token_count` kept as a non-projecting turn-boundary marker · contract / shape · unhappy-path