Skip to content

Support stdin for tool args#286

Merged
ScriptedAlchemy merged 28 commits into
masterfrom
codex/cli-args-stdin
Jul 5, 2026
Merged

Support stdin for tool args#286
ScriptedAlchemy merged 28 commits into
masterfrom
codex/cli-args-stdin

Conversation

@ScriptedAlchemy

Copy link
Copy Markdown
Owner

Summary

  • add --args - and --args @- stdin support for tracedecay tool JSON payloads
  • keep normal --key value flags as the primary CLI path and document JSON/file/stdin as the escape hatch
  • add hermetic Sonnet/Codex eval coverage for MCP-first vs CLI-fallback tool-args scenarios

Tests

  • bash -n eval/hermetic/run.sh
  • JSONL validation for eval/hermetic/corpora/tool-args-ergonomics.jsonl
  • python3 -m unittest eval.test_run_real_model
  • cargo fmt --check
  • cargo test -q --bin tracedecay tool_command::tests::args_escape_hatch
  • cargo test -q --test agent_suite shared_skill_contract_test
  • cargo test -q --test agent_suite plugin_skill_contract_test
  • /tmp staged binary smoke: tracedecay tool diff_context --args -
  • timeout 180s bash -lc 'ENV=$(eval/hermetic/run.sh setup --agent codex --debug); eval/hermetic/run.sh teardown --env-dir "$ENV"'\n- timeout 180s bash -lc 'ENV=$(eval/hermetic/run.sh setup --agent claude --debug); eval/hermetic/run.sh teardown --env-dir "$ENV"'

@changeset-bot

changeset-bot Bot commented Jul 4, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: d0e7900

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@ScriptedAlchemy ScriptedAlchemy marked this pull request as ready for review July 5, 2026 00:48
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@ScriptedAlchemy

Copy link
Copy Markdown
Owner Author

Completed the plan implementation on the rebased branch.

Local verification:

  • cargo fmt --check
  • cargo test -q --bin tracedecay tool_command::tests
  • cargo test -q --test agent_suite cli_args_contract
  • cargo test -q --lib -- --test-threads=1
  • python3 -m py_compile eval/hermetic/score.py eval/test_run_real_model.py
  • bash -n eval/hermetic/run.sh

Note: default parallel cargo test --lib exposes existing shared-env test interference in hooks/response-handle tests; the same cases pass isolated and the lib suite passes with --test-threads=1.

@ScriptedAlchemy

Copy link
Copy Markdown
Owner Author

Eval pass on current head 9161ef01:

model corpus score
sonnet tool-args-ergonomics.jsonl 2/2
sonnet tool-args-agent-path.jsonl 5/8
opus tool-args-ergonomics.jsonl 2/2
opus tool-args-agent-path.jsonl 5/8

Agent-path misses:

  • ap-multiline-string: verify passed for both models; Sonnet had no missing expected CLI, Opus missed exact tracedecay tool insert_at scoring string.
  • ap-typo-recovery: both models missed expected tool search scoring string.
  • ap-dry-run-preflight: verify passed for both models; scored fail with expected CLI present.

Artifacts kept locally:

  • Sonnet env: /scratch/tmp/eval-env-20260705-013356-725321
  • Opus env: /scratch/tmp/eval-env-20260705-013755-757997

Current PR checks are green via gh pr checks 286.

ScriptedAlchemy and others added 23 commits July 5, 2026 04:19
The concurrent stdin-args work added the bare-missing-path error case but
not the success case; assert a bare path (no @ sigil) reads the payload,
matching the memory curate --llm-ops convention this aligned to.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Plan document only — no source changes. Evaluates whether the per-key
--key value surface should be the agent-facing contract at all, and
recommends a JSON-first agent path (--args with stdin/heredoc as the MCP
arguments object, per-key kept as human sugar), a shared schema
validation gate with a corrective-error contract, --dry-run, help/skill/
steering alignment, and a detailed hermetic eval plan (Sonnet + Codex)
with baseline-vs-after protocol.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Rewrite every agent-facing surface — Codex/Cursor steering, server
instructions, degraded-serve notice, prompt rules, the using-the-cli
skill, the arg catalog, and the dozen skills that repeated the grammar —
to the Section 5.1 contract: the arguments of `tracedecay tool <name>`
are the tool's MCP arguments object, passed whole via --args (heredoc
stdin as the canonical form). Fixes the actively-wrong catalog rows
(multi_str_replace, body, callers/callees/impact/rename_preview,
signature, signature_search, similar) that taught flags the schemas
don't have.
One validation pass (validate_tool_args) over the final arguments object,
shared by the --args and per-key paths: unknown keys error with a
did-you-mean and the valid flag list, enum values are enforced with the
allowed set in the message, JSON types are checked (including array item
shapes for array-of-array params), and required keys now also gate --args
payloads. Known dispatch-layer routing keys (project_root, storage_scope,
hermes_home, cwd) pass through so schema-exact integrations keep working.

Also: --dry-run prints the validated arguments object without dispatching;
--key=value is accepted; array/object per-key values parse as JSON when
they are JSON; bare/mistyped boolean flags and missing values state the
exact fix; single-dash typos of known flags get a did-you-mean; unknown
tool names suggest the nearest name. Test payloads using the phantom
fact_type key move to the real category key — the gate caught the drift.
render_tool_cli_help now prints enum values, array item shapes, and a
generated --args heredoc example for any tool with non-scalar params, and
the three drifting reserved-flag footers collapse into one shared
RESERVED_FLAGS_FOOTER. New cli_args_contract_test pins the taught model
to the parser: steering/skill/catalog must teach the --args JSON contract,
and every flag the catalog documents must exist in the tool's schema.
Eight tool-args-agent-path scenarios covering the hard shapes (array of
pairs, multiline strings, nested objects, enums, argv-cap stdin, typo
recovery, help-only construction, --dry-run pre-flight) against a
committed quoting-gauntlet fixture pair. run.sh gains a fixtures
subcommand, fixture:<name> project_dir resolution, --reps with
between-rep fixture resets, and per-scenario setup_cmd/verify_cmd;
score.py folds verify_cmd exit status into pass (the silent-failure
detector) and reports tool_cmd_attempts/self_corrected.
`tracedecay init` refuses a project already registered in the isolated
data dir, so re-running the fixtures subcommand (or the between-reps
reset) died on the second staging. Fall back to `sync --force` to
rebuild the index for the fresh copy.
OAuth session credentials in ~/.claude/.credentials.json expire and
briefly stranded the harness with "Not logged in" sessions. seed_auth
now also honors a long-lived `claude setup-token` grant — either via a
CLAUDE_CODE_OAUTH_TOKEN already in the environment or a
~/.claude/.claude_code_oauth_token file, exported through env.sh so
run/smoke inherit it. Verified with a token-seeded smoke run (1/1).
Deduplicate schema/required parsing, typo-distance helpers, and short
tool names; hoist take_flag_value and drop redundant branches/comments.
@ScriptedAlchemy ScriptedAlchemy merged commit 12f6c4f into master Jul 5, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant