Skip to content

TypeAgent Studio: replay fidelity rungs + Impact Report controls#2569

Draft
TalZaccai wants to merge 30 commits into
mainfrom
dev/talzacc/typeagent_studio_part5
Draft

TypeAgent Studio: replay fidelity rungs + Impact Report controls#2569
TalZaccai wants to merge 30 commits into
mainfrom
dev/talzacc/typeagent_studio_part5

Conversation

@TalZaccai

Copy link
Copy Markdown
Contributor

Stacked PR. This builds on the previous Studio PR (#2553)
and targets that branch. Merge the previous PR first, then this one rebases
onto main cleanly. The diff below is scoped to commits unique to this branch.

Advances the "find a regression" journey by making replay's fidelity explicit and
the Impact Report controls easier to reason about — without adding heavy UI text.

Critical changes

Two-mode replay. The Impact Report can replay in nfa-grammar mode (both
sides match the compiled grammar only, kept A/B-symmetric) or
completionBased-cache mode (the working-tree side consults the live
construction cache first, the way the dispatcher would).

Wildcard validation. An opt-in pass that runs the
agent's real validateWildcardMatch over wildcard matches and drops the ones it
rejects, for allowlisted agents only (player removed from the allowlist).

Fidelity transparency matrix. Each run reports a per-side matrix of which
deterministic layers (grammar, schema enrichment, construction cache, wildcard
validation, dispatch) actually ran on A vs B, with a status icon and a hover
reason — so the report is honest about exactly what it exercised. Backed by a new
deriveSideFidelity in the core runtime.

Replay-options popover. The Cache and Wildcard-validation toggles move behind
a gear "Replay options" popover with VS Code-themed on/off pill switches (closes
on outside-click/Escape, disabled mid-run). Tooltips trimmed to one concise line
each.

Corpora auto-refresh. A FileSystemWatcher on **/*.utterances.jsonl
refreshes the Corpora tree on corpus create/change/delete, fixing the stale tree
after "Seed in-repo corpus…" + save.

Supporting changes

  • Broke a build-graph cycle (studio-service → default-agent-provider → studio-agent → studio-service) by moving the default-agent-provider
    dependency to the bundling extension and resolving it via a tolerant,
    variable-indirected dynamic import() (kept external in the bundle).
  • Refreshed Studio STATUS.md to reflect two-mode replay and wildcard
    validation.

Testing

Studio suite green (187 tests); core fidelity spec added (16 tests); full
pnpm build clean; typecheck and Prettier clean.

22 files changed (+2,492 / −40), 6 commits.

TalZaccai and others added 30 commits June 22, 2026 16:20
…atcher, replay

Add computeActionSchemaFileHash to agent-cache (schemaInfoProvider) as the
single source of truth for the schema cache-namespace hash. The dispatcher's
actionSchemaFileCache and typeagent-core's constructionCacheResolver now both
delegate to it, removing two copies of the sha256/base64 hashing logic (and a
copy-pasted helper in the dispatcher test). Aligns the config-truthiness check
so the namespace key is byte-identical across producers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Promote the crypto-strong webview nonce generator to a new
@typeagent/core/webview subpath. typeagent-studio's webviewKit re-exports it,
and vscode-shell's chatViewProvider now uses it in place of a weak Math.random
getNonce. CSP/HTML stay per-package (the chat UI needs 'unsafe-inline' styles;
studio uses a style nonce) - only the nonce is shared.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a single createWebSocketRpcChannel to @typeagent/websocket-utils/rpcChannel
(built on agent-rpc createChannelAdapter). The studio-service server and the
studio extension client both re-export it, deleting two identical hand-rolled
WebSocket-to-RpcChannel adapters.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract resolveAgentName into a dependency-free sandbox/agentRef module.
inMemorySandboxManager's default loader now uses it instead of a private
deriveAgentName subset (which mishandled packages/agents/<name> paths);
repoAgentLoader re-exports it so its API and tests are unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the leftover 'Hello (skeleton)' placeholder command (and its
activation event) and rename the connect command from the misleading
'Connect Event Log to studio service' to 'Connect to Studio service',
since it connects every channel-backed view, not just the Event Log.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tate to the Impact Report

The report now opens with a differences-only filter (equal rows hidden)
so a replay surfaces regressions instead of burying them under unchanged
rows. Clickable per-status chips show live counts and toggle each status
in place; a note reports how many rows are hidden, and an all-equal run
shows a positive 'no differences' message instead of an empty table.

Before any replay runs, a first-run empty state explains the A/B compare
and how to start one. Filter and empty-state logic live as pure helpers
in the view model (unit-tested); the client only does DOM wiring.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ake empty chips inert

The single crammed detail string is broken into discrete columns
(Utterance, Status, Base (A), Compare (B), Latency) so each side's
resolution and the latency pair are read at a glance instead of parsed
out of one cell. A status filter chip with a zero count is now disabled
rather than clickable, since there is nothing to filter on.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both the Studio service connection and the vscode-shell agent-server
bridge previously hand-rolled their own reconnect delays (an array clamped
at its last value, and a linear ramp). Extract a single exponential-with-cap
strategy into @typeagent/websocket-utils/backoff and wire both call sites to
it, keeping each surface's local UI (Studio's single-flight retry timer and
fencing; the shell's countdown ribbon and reconnectStatus broadcast).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The four VS Code tree views (Event Log, Collisions, Sandboxes, Corpora)
each re-declared the same change emitter, connected flag + setConnected
gate, refresh, and descriptor-to-TreeItem field mapping. Hoist those into
BaseStudioTreeProvider<TNode>; each provider now supplies only what differs
(getChildren, collapsible state, icon/command decoration) and overrides
dispose to tear down its own subscription before calling super.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The event-log, replay, and corpus presenters each hand-rolled the same
collapse-whitespace + cap-with-ellipsis routine. Extract collapseAndTruncate
into textFormatting and delegate to it, so the truncation behaviour is
defined once.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The feedback-badge test used rating negative, which is not a member of
FeedbackRating (up | down), so the file failed tsc type-checking even though
esbuild/tsx run it regardless. Use down.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rework the Impact Report webview layout and styling, and add an All
filter pill that shows every result and is selected by default after a
run. Also share the collapse-and-truncate text helper in the replay
view model.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement the Impact Report's version and agent selection as native VS Code
QuickPicks driven entirely by the extension host (the webview never shells
out to git — the security boundary). The host enumerates refs with two
parallel git calls (one log for HEAD + current branch + recent commits, one
for-each-ref for branches and tags), caches them per panel, and lazily
appends remote-tracking branches on request. Users can also type a free-form
commit SHA, tag, or relative ref (e.g. HEAD~3), validated on the host.

Add row drill-in (U4a): clicking a changed/new/lost row opens a detail pane
with a unified A/B JSON diff of the action.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Launch the Impact Report from a graph icon on each Corpora agent row instead of a single global panel + in-webview agent picker. Panels are now keyed per agent (WebviewKitPanel instanceKey), so reports for different agents open as separate tabs that can sit side-by-side; the agent is shown read-only in the action bar and in the tab title. The webview protocol drops the agent picker round-trip (init now carries the fixed agent + availability) and the host uses its authoritative agent for the replay.

Also lands the native VS Code visual redesign of the report (codicon font wired through esbuild, action bar, native list/detail styling).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… indicator

Collapse the four repeated 'not connected' welcome blocks into concise,
button-less messages that defer to the status bar, and reframe the status
bar's disconnected state as 'reconnecting' to reflect the existing
auto-retry. In the Impact Report webview, drop the manual reconnect
button for a single connection pill that mirrors the shared service
connection and re-inits automatically on each reconnect.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ect countdown

Replace the per-view 'connecting' welcome blocks with a single global signal:
the status bar item now shows a live 'reconnecting in Ns...' countdown, and
each sidebar tree holds VS Code's native loading bar until the service connects
(shared whenConnected gate in BaseStudioTreeProvider). Once connected, views
render their real rows or empty-state placeholder rows. Removes the manual
reconnect affordances and the duplicated welcome text.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…he Impact Report

Each table cell now carries a hover description: status terms (equal/changed/
new match/lost match), per-side resolution tokens (hit, miss, needs-explanation,
llm-resolved, etc.), and the A/B latency pair. The latency column is left-aligned.
Equal rows are now clickable too — they highlight as selected and close any open
A/B diff pane (selection is tracked separately from the open detail so equal
rows can read as selected with no pane).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icons for status

Sandbox, Corpora, Event Log, and Collisions rows now build a structured,
vscode-free TooltipModel that BaseStudioTreeProvider renders into a
MarkdownString hover card (bold labels, code-styled hashes/paths/ids, an
optional affordance hint). Status is carried by row icons rather than
duplicated text: sandbox agents drop the redundant health word (keeping the
schema fingerprint), corpus entries show a coloured thumbs-up/down icon
instead of a feedback text badge, and collision exemplar rows drop their
redundant description.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Free-form and webview-supplied git refs were passed to git as positional
arguments ahead of the \--\ path separator. git still parses a leading-dash
token as an option there, so a ref like \--output=<path>\ was honored and
could make \git log\/\git show\ write a file. Insert \--end-of-options\
before the ref in resolveRef, resolveVersionProvenance, and the replay
grammar resolver's \git show <ref>:<path>\ so a leading-dash ref is always
treated as a revision. Add a regression test asserting the guard precedes the
ref.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
resolveAgentName derived the agent's package name from a hard-coded
\packages/agents/\ marker, so a reference under a configured (non-default)
agent root whose path ends in e.g. \<root>/<agent>/src\ would resolve to the
wrong segment. Accept the configured agentRoots and, when the reference sits
under one, take the first segment after the longest matching root; the marker
remains a fallback for references that don't match a known root. repoAgentLoader
now passes its resolved roots through.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CodeQL flagged \/\\/+\$/\ in resolveAgentName as a polynomial regex that can
run slowly on a root string with many trailing slashes. Replace it with a
linear character-walk helper (stripTrailingSlashes), mirroring the loop already
used for the reference itself.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- createWebSocketRpcChannel: decode each \ws\ RawData shape (string, Buffer,
  ArrayBuffer, Buffer[]) explicitly instead of a bare toString(), which would
  garble ArrayBuffer/Buffer[] frames and silently drop valid RPC messages.
- Impact Report: fix two stale comments — the default filter shows all statuses
  (not differences-only), and the action bar has a connection indicator with
  auto-reconnect (not a manual reconnect button).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ache)

Make the replay's deterministic dispatch model explicit instead of implicitly
mixing two paths that never coexist in a real dispatcher config.

Level A (core gating): add StudioReplayMode = "nfa-grammar" |
"completionBased-cache" and mode? on StudioReplayRequest (default
nfa-grammar); gate the live construction-cache consult behind
completionBased-cache. Default runs are now grammar-only and A/B-symmetric.

Level B (plumbing + UI + test):
- Thread mode through the webview run message; parseWebviewMessage validates
  it (unknown/missing -> nfa-grammar). The host forwards it into replayCorpus;
  the channel/RPC layers ride on StudioReplayRequest unchanged.
- Add a two-state Grammar/Cache toggle to the Impact Report action bar with
  explanatory tooltips, persisted/restored with the version selection.
- Add an injectable resolveConstructionCache seam to CreateStudioRuntimeOptions
  so the gating is testable without a live cache.
- Tests: 3 runtime gating tests over a scaffolded agent (cache skipped in
  nfa-grammar / when mode omitted, consulted in completionBased-cache); update
  webview protocol run-message expectations to include mode.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replay now optionally runs each agent's real validateWildcardMatch over the
working-tree side's wildcard grammar matches -- the dispatcher's only
beyond-grammar determinism (getValidatedMatches) -- dropping a match the agent
rejects, exactly as the dispatcher does. Working-tree side only; git-ref side
stays grammar-only. Opt-in (default off) and fail-open: only an explicit false
rejects; a missing/throwing validator or unloadable agent accepts and records a
diagnostic, so replay never fabricates a lost match from infrastructure noise.

Core (@typeagent/core, dependency-light):
- replay/wildcardValidator.ts: createWildcardMatchValidator with an INJECTED
  ReplayAppAgentLoader (no dispatcher dep), empty-object stub SessionContext,
  allowlist (timer/list/player), fail-open diagnostics, dispose->unloadAppAgent.
- grammarResolver.ts: selectValidatedMatchAction walks the ranked MatchResult[]
  (wildcardCharCount===0 auto-accepts, first accepted wins, all-rejected =>
  needs-explanation); exposes wildcardValidationApplied. Working-tree only.
- studioRuntimeCore.ts: StudioReplayRequest.validateWildcards opt-in,
  resolveWildcardValidator injectable seam, StudioReplayResult.wildcardValidation
  summary, validator build/dispose lifecycle. Re-exported the API from
  @typeagent/core/runtime so the host can wire a real loader.

Host (studio-service):
- wildcardValidation.ts: a default loader that lazily imports
  default-agent-provider (marked external in the service bundle), so it resolves
  on the in-repo dev path and cleanly no-ops in the packaged .vsix. Gated on the
  allowlist so the import only fires for an allowlisted wildcard match.

Webview (typeagent-studio):
- A lit Validate toggle (mirrors the mode toggle), threaded through the run
  protocol message, with an honest sub-bar indicator: wildcard-validated, or a
  warning-tinted unavailable/skipped/no-validator/degraded from the diagnostics.

Tests: core +40 (wildcardValidator + grammarReplayResolver L4a), studio protocol
expectations. core 189, studio 184, studio-service 21 green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A live integration smoke against the real built agent modules (via
default-agent-provider) surfaced that player's validateWildcardMatch throws
during replay: player runs execMode "separate", so the loader returns an RPC
proxy whose child process reconstructs its own SessionContext where agentContext
is undefined (our agentContext:{} stub is never serialized across the wire).
player then throws "Cannot read properties of undefined (reading 'spotify')"
before reaching its no-client self-degrade guard. Even without the throw it can
only ever return true without a live Spotify client, so it adds no fidelity --
it would just fail open with an `errored` diagnostic and a misleading "degraded"
indicator.

timer and list, by contrast, ignore the context and validate correctly over RPC
(timer even produces real rejects), so the allowlist is now timer/list only.
Updated the stub-context rationale, the webview toggle tooltip, and the tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update the capability matrix, long-pole narrative, and next-slice list to reflect shipped two-mode replay (grammar/cache) and L4a live wildcard validation, with L4b (build-from-ref) deferred to P-7 (post-Gate-C). The live priority is now player corpus capture -> Gate C measurement.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Surface a per-side fidelity matrix in the Impact Report so a replay is honest about which deterministic layers actually ran (grammar, schema enrichment, construction cache, wildcard validation) and what building from a git ref would add, instead of over-claiming fidelity.

Core: add FidelityLayer/SideFidelity types + sideFidelity on StudioReplayResult, populated by a pure, unit-tested deriveSideFidelity() in both the success and aborted paths. View model: pure toFidelityMatrix() with a build-from-ref preflight hint. Client+CSS: a collapsible fidelity panel with per-layer status icons and hover reasons.

Also break a build cycle introduced by L4a: studio-service no longer declares default-agent-provider (it is aggregated by studio-agent, which depends on studio-service). The optional, externally-bundled dynamic import now resolves from the bundling extension (typeagent-studio), which owns the dependency; the specifier is indirected so tsc does not statically resolve it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cleanup

Impact Report:

- Move the Cache and Wildcard validation toggles behind a gear 'Replay options' popover with VS Code-themed on/off pill switches (closes on outside-click/Escape, disabled mid-run). Add the settings-gear codicon glyph to the curated set so the icon-only button renders.

- Relabel the old Grammar/Cache mode toggle to a Cache on/off switch and Validate to Wildcard validation, with concise one-line tooltips.

- Remove the verbose source-side preflight hint from the fidelity matrix (and its CSS + tests) per UI 'avoid heavy text' guidance.

Corpora view:

- Add a FileSystemWatcher on **/*.utterances.jsonl so the tree refreshes on corpus create/change/delete instead of only on manual refresh (fixes the seed-then-save stale tree).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant