Support stdin for tool args#286
Conversation
|
5ff5bb8 to
b4c1236
Compare
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Completed the plan implementation on the rebased branch. Local verification:
Note: default parallel cargo test --lib exposes existing shared-env test interference in hooks/response-handle tests; the same cases pass isolated and the lib suite passes with --test-threads=1. |
b4c1236 to
3a3d137
Compare
|
Eval pass on current head
Agent-path misses:
Artifacts kept locally:
Current PR checks are green via |
The concurrent stdin-args work added the bare-missing-path error case but not the success case; assert a bare path (no @ sigil) reads the payload, matching the memory curate --llm-ops convention this aligned to. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Plan document only — no source changes. Evaluates whether the per-key --key value surface should be the agent-facing contract at all, and recommends a JSON-first agent path (--args with stdin/heredoc as the MCP arguments object, per-key kept as human sugar), a shared schema validation gate with a corrective-error contract, --dry-run, help/skill/ steering alignment, and a detailed hermetic eval plan (Sonnet + Codex) with baseline-vs-after protocol. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Rewrite every agent-facing surface — Codex/Cursor steering, server instructions, degraded-serve notice, prompt rules, the using-the-cli skill, the arg catalog, and the dozen skills that repeated the grammar — to the Section 5.1 contract: the arguments of `tracedecay tool <name>` are the tool's MCP arguments object, passed whole via --args (heredoc stdin as the canonical form). Fixes the actively-wrong catalog rows (multi_str_replace, body, callers/callees/impact/rename_preview, signature, signature_search, similar) that taught flags the schemas don't have.
One validation pass (validate_tool_args) over the final arguments object, shared by the --args and per-key paths: unknown keys error with a did-you-mean and the valid flag list, enum values are enforced with the allowed set in the message, JSON types are checked (including array item shapes for array-of-array params), and required keys now also gate --args payloads. Known dispatch-layer routing keys (project_root, storage_scope, hermes_home, cwd) pass through so schema-exact integrations keep working. Also: --dry-run prints the validated arguments object without dispatching; --key=value is accepted; array/object per-key values parse as JSON when they are JSON; bare/mistyped boolean flags and missing values state the exact fix; single-dash typos of known flags get a did-you-mean; unknown tool names suggest the nearest name. Test payloads using the phantom fact_type key move to the real category key — the gate caught the drift.
render_tool_cli_help now prints enum values, array item shapes, and a generated --args heredoc example for any tool with non-scalar params, and the three drifting reserved-flag footers collapse into one shared RESERVED_FLAGS_FOOTER. New cli_args_contract_test pins the taught model to the parser: steering/skill/catalog must teach the --args JSON contract, and every flag the catalog documents must exist in the tool's schema.
Eight tool-args-agent-path scenarios covering the hard shapes (array of pairs, multiline strings, nested objects, enums, argv-cap stdin, typo recovery, help-only construction, --dry-run pre-flight) against a committed quoting-gauntlet fixture pair. run.sh gains a fixtures subcommand, fixture:<name> project_dir resolution, --reps with between-rep fixture resets, and per-scenario setup_cmd/verify_cmd; score.py folds verify_cmd exit status into pass (the silent-failure detector) and reports tool_cmd_attempts/self_corrected.
`tracedecay init` refuses a project already registered in the isolated data dir, so re-running the fixtures subcommand (or the between-reps reset) died on the second staging. Fall back to `sync --force` to rebuild the index for the fresh copy.
OAuth session credentials in ~/.claude/.credentials.json expire and briefly stranded the harness with "Not logged in" sessions. seed_auth now also honors a long-lived `claude setup-token` grant — either via a CLAUDE_CODE_OAUTH_TOKEN already in the environment or a ~/.claude/.claude_code_oauth_token file, exported through env.sh so run/smoke inherit it. Verified with a token-seeded smoke run (1/1).
Deduplicate schema/required parsing, typo-distance helpers, and short tool names; hoist take_flag_value and drop redundant branches/comments.
ddcd560 to
07fc2d6
Compare
Summary
--args -and--args @-stdin support fortracedecay toolJSON payloads--key valueflags as the primary CLI path and document JSON/file/stdin as the escape hatchTests
bash -n eval/hermetic/run.sheval/hermetic/corpora/tool-args-ergonomics.jsonlpython3 -m unittest eval.test_run_real_modelcargo fmt --checkcargo test -q --bin tracedecay tool_command::tests::args_escape_hatchcargo test -q --test agent_suite shared_skill_contract_testcargo test -q --test agent_suite plugin_skill_contract_test/tmpstaged binary smoke:tracedecay tool diff_context --args -timeout 180s bash -lc 'ENV=$(eval/hermetic/run.sh setup --agent codex --debug); eval/hermetic/run.sh teardown --env-dir "$ENV"'\n-timeout 180s bash -lc 'ENV=$(eval/hermetic/run.sh setup --agent claude --debug); eval/hermetic/run.sh teardown --env-dir "$ENV"'