Recover Codex token counts for backfill and normalize token usage: net input + one carrier row per response#136
Recover Codex token counts for backfill and normalize token usage: net input + one carrier row per response#136bgmcmullen wants to merge 1 commit into
Conversation
Dual-agent review —
|
| Source | Finding (severity, evidence) | Intersects |
|---|---|---|
| Claude/Codex | last vs total token_usage not distinguished in tests (minor; codex-backfill.test.js:344) | Targets: codex codexUsageFromTokenCount; Risk: silent drop / per-turn correctness |
| Claude | no multi-turn coverage for turnStartIndex (minor; codex/backfill.js:495) |
Targets: codex projectedExchangeFromSession; Risk: silent usage drop edge |
| Contract (sub-threshold, conf 70) | stale analyst-agent.md MAX-dedupe guidance (minor) | Risk: stale in-repo consumer guidance |
| Contract (sub-threshold, conf 55) | live/backfill carrier-selection asymmetry, benign today (nit) | Risk: silent drop / convergence invariant |
Codex review
Fix Validations
Claude usage overcount from per-block duplication
- Status: correct
- Evidence: hypaware-core/plugins-workspace/claude/src/projector.js:381, hypaware-core/plugins-workspace/claude/src/backfill.js:307, hypaware-core/plugins-workspace/ai-gateway/src/message_projector.js:727
- Assessment: Live, backfill, and gateway multi-block expansion now converge on one usage carrier: the last block/part only.
Codex backfill token-count recovery
- Status: correct
- Evidence: hypaware-core/plugins-workspace/codex/src/backfill.js:412, hypaware-core/plugins-workspace/codex/src/backfill.js:725, hypaware-core/plugins-workspace/codex/src/backfill.js:748, hypaware-core/plugins-workspace/codex/src/backfill.js:773
- Assessment: The parser keeps
token_countas a non-message boundary marker, useslast_token_usagerather than cumulative totals, normalizes cached input to net input, and stamps the last eligible assistant message.
Codex live usage normalization
- Status: correct
- Evidence: hypaware-core/plugins-workspace/codex/src/exchange-projector.js:223, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:284, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:526, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:564
- Assessment: Chat and Responses both normalize OpenAI usage into
attributes.usageand attach it once to the final assistant item.
Findings
9) Test Evidence Quality
- Severity: minor
- Confidence: high
- Evidence: hypaware-core/plugins-workspace/codex/src/backfill.js:725, test/plugins/codex-backfill.test.js:321, test/plugins/codex-backfill.test.js:344
- Why it matters: The PR’s key Codex backfill fix is “use
last_token_usage, not cumulativetotal_token_usage,” but the new test sets both fields to the same object, so a regression back to cumulative accounting would still pass. - Suggested fix: Add a two-turn rollout test where each
last_token_usageis a delta andtotal_token_usageis cumulative, then assert the projected/summed usage equals the deltas only.
No Finding
-
- Behavioral Correctness
-
- Contract & Interface Fidelity
-
- Change Impact / Blast Radius
-
- Concurrency, Ordering & State Safety
-
- Error Handling & Resilience
-
- Security Surface
-
- Resource Lifecycle & Cleanup
-
- Release Safety
-
- Architectural Consistency
-
- Debuggability & Operability
Evidence Bundle
- Changed hot paths:
expandMessagePartsusage stripping, Claude live/backfill assistant projection, Codex rollouttoken_countparsing/projection, Codex OpenAI Chat/Responses usage projection. - Impacted callers: hypaware-core/plugins-workspace/ai-gateway/src/message_projector.js:451, hypaware-core/plugins-workspace/claude/src/projector.js:198, hypaware-core/plugins-workspace/claude/src/backfill.js:320, hypaware-core/plugins-workspace/codex/src/backfill.js:183, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:196, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:198
- Impacted tests: test/plugins/ai-gateway-message-projector.test.js:497, test/plugins/claude-backfill.test.js:368, test/plugins/claude-projector-identity.test.js:87, test/plugins/codex-backfill.test.js:321, test/plugins/codex-exchange-projector.test.js:59, test/plugins/codex-exchange-projector.test.js:149, test/plugins/codex-exchange-projector.test.js:248
- Unresolved uncertainty: I did not run the suite; review is static against the supplied diff plus targeted trace searches.
Claude review
Claude review
Five parallel review lenses (guidance, shallow-bug, history, contract/callers,
comments/tests) plus a blast-radius pass. Only findings scoring ≥80 confidence
are kept below. The implementation itself was judged correct across all four
projection paths by every lens and independently by Codex; the surviving
findings are test-evidence gaps, not code defects.
Codex backfill test cannot catch a regression to cumulative token totals
- Severity: minor
- Confidence: 88
- Evidence: test/plugins/codex-backfill.test.js:344
- Why it matters: The PR's core per-turn rule (LLP 0035#per-turn) is "read
last_token_usage, NOT cumulativetotal_token_usage" — the explicit multiply-count trap — but the onlytoken_counttest setslast_token_usageandtotal_token_usageto the same object, so a regression that read the cumulative total instead would still pass. Codex independently flagged this exact gap. - Suggested fix: Make the two fields differ (a two-turn rollout, or
total_token_usagelarger thanlast_token_usage) and assert the stamped usage equalslast_token_usageonly.
No multi-turn coverage for the codex backfill turn boundary (turnStartIndex)
- Severity: minor
- Confidence: 80
- Evidence: hypaware-core/plugins-workspace/codex/src/backfill.js:495, test/plugins/codex-backfill.test.js:321
- Why it matters: The new
turnStartIndexadvance is the logic that stops turn 2's usage from being stamped onto turn 1's last assistant row; with only a single-turntoken_counttest, a bug that mis-attributes a later turn's usage to an earlier turn (or double-stamps) would not be caught. - Suggested fix: Add a rollout with two turns each ending in a
token_countwith distinct usage, and assert each turn's last assistant message carries its own turn's usage with no earlier row overwritten.
Sub-threshold (noted, not counted)
- Stale consumer doc (conf 70):
claude/agents/hypaware-analyst.md:39still tells querying agents to MAX-dedupeusage"because it repeats across parts" — the pre-PR shape. MAX still returns correct numbers, but the shipped query-guidance doc was not updated alongside LLP 0035/0026. Worth a one-line fix. - Live/backfill carrier asymmetry (conf 55): codex live
stampUsageOnLastAssistantstamps the last assistant of any block type; backfillstampUsageOnTurnskips reasoning-only. Verified benign today (live Responses never emits reasoning-only assistant messages) but rests on an implicit invariant — a one-line note or shared predicate would harden it. - Window-filter edge (conf 55): a
--since/--untilboundary between a turn's messages and its latertoken_countmarker can drop or (rarely) mis-attribute that turn's usage in a bounded backfill; full-history backfill is unaffected.
Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr-136/dual-review/pr-136
🧭 Decision map — where to spend your attentionCompanion to the dual-review verdict. This casts no verdict — it points at the 5 forks where the author made a real choice, so you can skim the rest. Scanned: 31 hunks across 14 files (6 source, 5 test, 3 LLP). Most is mechanical: the test fixtures, the LLP corpus (new 0035 + 0026/0000 edits), JSDoc/ 1.
|
Makes
attributes.usageonai_gateway_messagesanalyzable withoutprovider/capture-mode special-casing, and recovers Codex token counts for
backfill. Motivated by an analyst hitting a silent ~2.5× overcount and the
wrong storage path. New design doc: LLP 0035; revises LLP 0026.
What & why
Token usage was a confidently-wrong-numbers trap:
usage.input_tokensmeant uncached input for Claude but cache-inclusivefor Codex — same column, two meanings.
usageonto every content-block row,so a naive
SUMover-counted ~2.5×; Codex put it on one row → inconsistent.raw_frame.message.usage, which is null forClaude live and all Codex (only Claude backfill has the raw frame, under the
flat
raw_frame.message_id).Two invariants fix it (LLP 0035):
input_tokensis uncached for both providers;cache rides
cache_read_tokens(+ cache_write_tokensfor Claude). Soinput + cache_read [+ cache_write] = total promptdataset-wide, andinput + cache_read + output == total_tokensreconciles.output item (
tool_use, else finaltext), the same row asstop_reason.A plain
SUM(attributes.usage.*)is now correct, and Claude and Codex renderidentically in the table.
Changes
Codex backfill — recover per-turn usage from
token_countrollout events(
last_token_usage, never the cumulativetotal_token_usage); net-inputnormalization; stamp on the turn's last assistant row.
Codex live (
exchange-projector.js) — net-input normalization; usage on thelast assistant message (
stampUsageOnLastAssistant).Claude live + backfill — stamp response-level
usageonce, on the lastblock of each API message (alongside
stop_reason), instead of every block.Gateway (
message_projector.js) —expandMessagePartsstampsusageononly the last part of a multi-block message (
stripUsage), making theone-carrier rule structural. (Claude backfill does emit multi-block carrier
messages —
reasoning + text,reasoning + tool_use, parallel tool calls —which were the only rows a plain
SUMover-counted.)Docs / memory — new LLP 0035 (incl. the canonical
attributes.usagequery);LLP 0026 consequence revised; the
reference-token-accounting-methodmemorycorrected (
attributes.usage, notraw_frame).Compatibility
No in-app consumer reads
attributes.usage— verified across context graph,enrichment, sinks, datasets, and vector search; it's only ad-hoc/skill SQL.
The
max()-per-COALESCE(raw_frame.message_id, message_id)rollup still workson both old (duplicated) and new (single-carrier) data.
Testing
New tests across all four projector paths + the gateway multi-block carrier
case; validated against a real Codex rollout (32 responses, sums reconcile).
Full suite green except 11 pre-existing iceberg/vector failures (missing native
deps), unrelated to this change.
🤖 Generated with Claude Code