TypeAgent Studio: replay fidelity rungs + Impact Report controls#2569
Draft
TalZaccai wants to merge 30 commits into
Draft
TypeAgent Studio: replay fidelity rungs + Impact Report controls#2569TalZaccai wants to merge 30 commits into
TalZaccai wants to merge 30 commits into
Conversation
…atcher, replay Add computeActionSchemaFileHash to agent-cache (schemaInfoProvider) as the single source of truth for the schema cache-namespace hash. The dispatcher's actionSchemaFileCache and typeagent-core's constructionCacheResolver now both delegate to it, removing two copies of the sha256/base64 hashing logic (and a copy-pasted helper in the dispatcher test). Aligns the config-truthiness check so the namespace key is byte-identical across producers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Promote the crypto-strong webview nonce generator to a new @typeagent/core/webview subpath. typeagent-studio's webviewKit re-exports it, and vscode-shell's chatViewProvider now uses it in place of a weak Math.random getNonce. CSP/HTML stay per-package (the chat UI needs 'unsafe-inline' styles; studio uses a style nonce) - only the nonce is shared. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a single createWebSocketRpcChannel to @typeagent/websocket-utils/rpcChannel (built on agent-rpc createChannelAdapter). The studio-service server and the studio extension client both re-export it, deleting two identical hand-rolled WebSocket-to-RpcChannel adapters. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract resolveAgentName into a dependency-free sandbox/agentRef module. inMemorySandboxManager's default loader now uses it instead of a private deriveAgentName subset (which mishandled packages/agents/<name> paths); repoAgentLoader re-exports it so its API and tests are unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the leftover 'Hello (skeleton)' placeholder command (and its activation event) and rename the connect command from the misleading 'Connect Event Log to studio service' to 'Connect to Studio service', since it connects every channel-backed view, not just the Event Log. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tate to the Impact Report The report now opens with a differences-only filter (equal rows hidden) so a replay surfaces regressions instead of burying them under unchanged rows. Clickable per-status chips show live counts and toggle each status in place; a note reports how many rows are hidden, and an all-equal run shows a positive 'no differences' message instead of an empty table. Before any replay runs, a first-run empty state explains the A/B compare and how to start one. Filter and empty-state logic live as pure helpers in the view model (unit-tested); the client only does DOM wiring. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ake empty chips inert The single crammed detail string is broken into discrete columns (Utterance, Status, Base (A), Compare (B), Latency) so each side's resolution and the latency pair are read at a glance instead of parsed out of one cell. A status filter chip with a zero count is now disabled rather than clickable, since there is nothing to filter on. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both the Studio service connection and the vscode-shell agent-server bridge previously hand-rolled their own reconnect delays (an array clamped at its last value, and a linear ramp). Extract a single exponential-with-cap strategy into @typeagent/websocket-utils/backoff and wire both call sites to it, keeping each surface's local UI (Studio's single-flight retry timer and fencing; the shell's countdown ribbon and reconnectStatus broadcast). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The four VS Code tree views (Event Log, Collisions, Sandboxes, Corpora) each re-declared the same change emitter, connected flag + setConnected gate, refresh, and descriptor-to-TreeItem field mapping. Hoist those into BaseStudioTreeProvider<TNode>; each provider now supplies only what differs (getChildren, collapsible state, icon/command decoration) and overrides dispose to tear down its own subscription before calling super. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The event-log, replay, and corpus presenters each hand-rolled the same collapse-whitespace + cap-with-ellipsis routine. Extract collapseAndTruncate into textFormatting and delegate to it, so the truncation behaviour is defined once. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The feedback-badge test used rating negative, which is not a member of FeedbackRating (up | down), so the file failed tsc type-checking even though esbuild/tsx run it regardless. Use down. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rework the Impact Report webview layout and styling, and add an All filter pill that shows every result and is selected by default after a run. Also share the collapse-and-truncate text helper in the replay view model. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement the Impact Report's version and agent selection as native VS Code QuickPicks driven entirely by the extension host (the webview never shells out to git — the security boundary). The host enumerates refs with two parallel git calls (one log for HEAD + current branch + recent commits, one for-each-ref for branches and tags), caches them per panel, and lazily appends remote-tracking branches on request. Users can also type a free-form commit SHA, tag, or relative ref (e.g. HEAD~3), validated on the host. Add row drill-in (U4a): clicking a changed/new/lost row opens a detail pane with a unified A/B JSON diff of the action. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Launch the Impact Report from a graph icon on each Corpora agent row instead of a single global panel + in-webview agent picker. Panels are now keyed per agent (WebviewKitPanel instanceKey), so reports for different agents open as separate tabs that can sit side-by-side; the agent is shown read-only in the action bar and in the tab title. The webview protocol drops the agent picker round-trip (init now carries the fixed agent + availability) and the host uses its authoritative agent for the replay. Also lands the native VS Code visual redesign of the report (codicon font wired through esbuild, action bar, native list/detail styling). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… indicator Collapse the four repeated 'not connected' welcome blocks into concise, button-less messages that defer to the status bar, and reframe the status bar's disconnected state as 'reconnecting' to reflect the existing auto-retry. In the Impact Report webview, drop the manual reconnect button for a single connection pill that mirrors the shared service connection and re-inits automatically on each reconnect. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ect countdown Replace the per-view 'connecting' welcome blocks with a single global signal: the status bar item now shows a live 'reconnecting in Ns...' countdown, and each sidebar tree holds VS Code's native loading bar until the service connects (shared whenConnected gate in BaseStudioTreeProvider). Once connected, views render their real rows or empty-state placeholder rows. Removes the manual reconnect affordances and the duplicated welcome text. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…he Impact Report Each table cell now carries a hover description: status terms (equal/changed/ new match/lost match), per-side resolution tokens (hit, miss, needs-explanation, llm-resolved, etc.), and the A/B latency pair. The latency column is left-aligned. Equal rows are now clickable too — they highlight as selected and close any open A/B diff pane (selection is tracked separately from the open detail so equal rows can read as selected with no pane). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icons for status Sandbox, Corpora, Event Log, and Collisions rows now build a structured, vscode-free TooltipModel that BaseStudioTreeProvider renders into a MarkdownString hover card (bold labels, code-styled hashes/paths/ids, an optional affordance hint). Status is carried by row icons rather than duplicated text: sandbox agents drop the redundant health word (keeping the schema fingerprint), corpus entries show a coloured thumbs-up/down icon instead of a feedback text badge, and collision exemplar rows drop their redundant description. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Free-form and webview-supplied git refs were passed to git as positional arguments ahead of the \--\ path separator. git still parses a leading-dash token as an option there, so a ref like \--output=<path>\ was honored and could make \git log\/\git show\ write a file. Insert \--end-of-options\ before the ref in resolveRef, resolveVersionProvenance, and the replay grammar resolver's \git show <ref>:<path>\ so a leading-dash ref is always treated as a revision. Add a regression test asserting the guard precedes the ref. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
resolveAgentName derived the agent's package name from a hard-coded \packages/agents/\ marker, so a reference under a configured (non-default) agent root whose path ends in e.g. \<root>/<agent>/src\ would resolve to the wrong segment. Accept the configured agentRoots and, when the reference sits under one, take the first segment after the longest matching root; the marker remains a fallback for references that don't match a known root. repoAgentLoader now passes its resolved roots through. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CodeQL flagged \/\\/+\$/\ in resolveAgentName as a polynomial regex that can run slowly on a root string with many trailing slashes. Replace it with a linear character-walk helper (stripTrailingSlashes), mirroring the loop already used for the reference itself. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- createWebSocketRpcChannel: decode each \ws\ RawData shape (string, Buffer, ArrayBuffer, Buffer[]) explicitly instead of a bare toString(), which would garble ArrayBuffer/Buffer[] frames and silently drop valid RPC messages. - Impact Report: fix two stale comments — the default filter shows all statuses (not differences-only), and the action bar has a connection indicator with auto-reconnect (not a manual reconnect button). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ache) Make the replay's deterministic dispatch model explicit instead of implicitly mixing two paths that never coexist in a real dispatcher config. Level A (core gating): add StudioReplayMode = "nfa-grammar" | "completionBased-cache" and mode? on StudioReplayRequest (default nfa-grammar); gate the live construction-cache consult behind completionBased-cache. Default runs are now grammar-only and A/B-symmetric. Level B (plumbing + UI + test): - Thread mode through the webview run message; parseWebviewMessage validates it (unknown/missing -> nfa-grammar). The host forwards it into replayCorpus; the channel/RPC layers ride on StudioReplayRequest unchanged. - Add a two-state Grammar/Cache toggle to the Impact Report action bar with explanatory tooltips, persisted/restored with the version selection. - Add an injectable resolveConstructionCache seam to CreateStudioRuntimeOptions so the gating is testable without a live cache. - Tests: 3 runtime gating tests over a scaffolded agent (cache skipped in nfa-grammar / when mode omitted, consulted in completionBased-cache); update webview protocol run-message expectations to include mode. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replay now optionally runs each agent's real validateWildcardMatch over the working-tree side's wildcard grammar matches -- the dispatcher's only beyond-grammar determinism (getValidatedMatches) -- dropping a match the agent rejects, exactly as the dispatcher does. Working-tree side only; git-ref side stays grammar-only. Opt-in (default off) and fail-open: only an explicit false rejects; a missing/throwing validator or unloadable agent accepts and records a diagnostic, so replay never fabricates a lost match from infrastructure noise. Core (@typeagent/core, dependency-light): - replay/wildcardValidator.ts: createWildcardMatchValidator with an INJECTED ReplayAppAgentLoader (no dispatcher dep), empty-object stub SessionContext, allowlist (timer/list/player), fail-open diagnostics, dispose->unloadAppAgent. - grammarResolver.ts: selectValidatedMatchAction walks the ranked MatchResult[] (wildcardCharCount===0 auto-accepts, first accepted wins, all-rejected => needs-explanation); exposes wildcardValidationApplied. Working-tree only. - studioRuntimeCore.ts: StudioReplayRequest.validateWildcards opt-in, resolveWildcardValidator injectable seam, StudioReplayResult.wildcardValidation summary, validator build/dispose lifecycle. Re-exported the API from @typeagent/core/runtime so the host can wire a real loader. Host (studio-service): - wildcardValidation.ts: a default loader that lazily imports default-agent-provider (marked external in the service bundle), so it resolves on the in-repo dev path and cleanly no-ops in the packaged .vsix. Gated on the allowlist so the import only fires for an allowlisted wildcard match. Webview (typeagent-studio): - A lit Validate toggle (mirrors the mode toggle), threaded through the run protocol message, with an honest sub-bar indicator: wildcard-validated, or a warning-tinted unavailable/skipped/no-validator/degraded from the diagnostics. Tests: core +40 (wildcardValidator + grammarReplayResolver L4a), studio protocol expectations. core 189, studio 184, studio-service 21 green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A live integration smoke against the real built agent modules (via
default-agent-provider) surfaced that player's validateWildcardMatch throws
during replay: player runs execMode "separate", so the loader returns an RPC
proxy whose child process reconstructs its own SessionContext where agentContext
is undefined (our agentContext:{} stub is never serialized across the wire).
player then throws "Cannot read properties of undefined (reading 'spotify')"
before reaching its no-client self-degrade guard. Even without the throw it can
only ever return true without a live Spotify client, so it adds no fidelity --
it would just fail open with an `errored` diagnostic and a misleading "degraded"
indicator.
timer and list, by contrast, ignore the context and validate correctly over RPC
(timer even produces real rejects), so the allowlist is now timer/list only.
Updated the stub-context rationale, the webview toggle tooltip, and the tests.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update the capability matrix, long-pole narrative, and next-slice list to reflect shipped two-mode replay (grammar/cache) and L4a live wildcard validation, with L4b (build-from-ref) deferred to P-7 (post-Gate-C). The live priority is now player corpus capture -> Gate C measurement. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Surface a per-side fidelity matrix in the Impact Report so a replay is honest about which deterministic layers actually ran (grammar, schema enrichment, construction cache, wildcard validation) and what building from a git ref would add, instead of over-claiming fidelity. Core: add FidelityLayer/SideFidelity types + sideFidelity on StudioReplayResult, populated by a pure, unit-tested deriveSideFidelity() in both the success and aborted paths. View model: pure toFidelityMatrix() with a build-from-ref preflight hint. Client+CSS: a collapsible fidelity panel with per-layer status icons and hover reasons. Also break a build cycle introduced by L4a: studio-service no longer declares default-agent-provider (it is aggregated by studio-agent, which depends on studio-service). The optional, externally-bundled dynamic import now resolves from the bundling extension (typeagent-studio), which owns the dependency; the specifier is indirected so tsc does not statically resolve it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cleanup Impact Report: - Move the Cache and Wildcard validation toggles behind a gear 'Replay options' popover with VS Code-themed on/off pill switches (closes on outside-click/Escape, disabled mid-run). Add the settings-gear codicon glyph to the curated set so the icon-only button renders. - Relabel the old Grammar/Cache mode toggle to a Cache on/off switch and Validate to Wildcard validation, with concise one-line tooltips. - Remove the verbose source-side preflight hint from the fidelity matrix (and its CSS + tests) per UI 'avoid heavy text' guidance. Corpora view: - Add a FileSystemWatcher on **/*.utterances.jsonl so the tree refreshes on corpus create/change/delete instead of only on manual refresh (fixes the seed-then-save stale tree). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Advances the "find a regression" journey by making replay's fidelity explicit and
the Impact Report controls easier to reason about — without adding heavy UI text.
Critical changes
Two-mode replay. The Impact Report can replay in
nfa-grammarmode (bothsides match the compiled grammar only, kept A/B-symmetric) or
completionBased-cachemode (the working-tree side consults the liveconstruction cache first, the way the dispatcher would).
Wildcard validation. An opt-in pass that runs the
agent's real
validateWildcardMatchover wildcard matches and drops the ones itrejects, for allowlisted agents only (player removed from the allowlist).
Fidelity transparency matrix. Each run reports a per-side matrix of which
deterministic layers (grammar, schema enrichment, construction cache, wildcard
validation, dispatch) actually ran on A vs B, with a status icon and a hover
reason — so the report is honest about exactly what it exercised. Backed by a new
deriveSideFidelityin the core runtime.Replay-options popover. The Cache and Wildcard-validation toggles move behind
a gear "Replay options" popover with VS Code-themed on/off pill switches (closes
on outside-click/Escape, disabled mid-run). Tooltips trimmed to one concise line
each.
Corpora auto-refresh. A
FileSystemWatcheron**/*.utterances.jsonlrefreshes the Corpora tree on corpus create/change/delete, fixing the stale tree
after "Seed in-repo corpus…" + save.
Supporting changes
studio-service → default-agent-provider → studio-agent → studio-service) by moving thedefault-agent-providerdependency to the bundling extension and resolving it via a tolerant,
variable-indirected dynamic
import()(kept external in the bundle).STATUS.mdto reflect two-mode replay and wildcardvalidation.
Testing
Studio suite green (187 tests); core fidelity spec added (16 tests); full
pnpm buildclean; typecheck and Prettier clean.22 files changed (+2,492 / −40), 6 commits.