examples/llm_server: log per-turn serving metrics by mergennachin · Pull Request #20455 · pytorch/executorch

mergennachin · 2026-06-23T20:12:12Z

Add lightweight turn-level observability to the example LLM server harness. The C++ worker now reports warm-resume accounting, prefill/decode duration, total worker time, and token rates in the terminal JSONL done message. The Python WorkerClient and SessionRuntime propagate those fields, and ServingChat emits one structured INFO log line per completed turn.

This is intended for local-agent demos and evals where we need to understand whether warm resume is firing and where time is spent, without changing the OpenAI-compatible response payload.

Add lightweight turn-level observability to the example LLM server harness. The C++ worker now reports warm-resume accounting, prefill/decode duration, total worker time, and token rates in the terminal JSONL done message. The Python WorkerClient and SessionRuntime propagate those fields, and ServingChat emits one structured INFO log line per completed turn. This is intended for local-agent demos and evals where we need to understand whether warm resume is firing and where time is spent, without changing the OpenAI-compatible response payload.

pytorch-bot · 2026-06-23T20:12:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20455

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 5 New Failures, 3 Unrelated Failures

As of commit 6454ad0 with merge base 58447b2 ():

NEW FAILURES - The following jobs have failed:

pull / test-arm-backend-no-driver (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 270f2c0d938a862ae33fec290fbcc229f0cb58ca8a2163ce9a156d66f9d235ec /exec failed with exit code 1
pull / test-llama-runner-qnn-linux (fp32, qnn_8a8w, qnn) / linux-job (gh)
RuntimeError: Command docker exec -t 2b98477c9ade97f25c0a0cd2b2c1ef9516b98a2860ba4085c013928c01697362 /exec failed with exit code 92
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t 0afc9527303bb8c324d610f61fd7c9ac0d58ceab49ded10e07f0b550cc514ce1 /exec failed with exit code 139
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 4b0102b52e16d3bbd42b33149cf82a9cb368f76a91bca52d44ad939826d60c79 /exec failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 65218ee49c2adf0aaf45df78f9f0ab43fd54b5926502890d0ebb3690ba9dfc6f /exec failed with exit code 1

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / unittest / macos / macos-job (gh) (similar failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest-editable / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mergennachin · 2026-06-23T20:12:49Z

@claude Review

github-actions · 2026-06-23T20:16:43Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Adds per-turn serving metrics to the example LLM server, propagating worker-reported timing/token-rate stats through the Python harness and emitting one log line per completed turn for observability during local demos/evals.

Changes:

Extend worker “done” JSONL payload with prefill/decode/total timings and token/sec rates (C++).
Parse/propagate the new metrics through WorkerClient and SessionRuntime (Python) with updated tests.
Emit per-turn generation metrics via a structured-ish INFO log line in ServingChat.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
examples/llm_server/python/worker_client.py	Adds new timing/token-rate fields to `WorkerStats` and parses them from done messages.
examples/llm_server/python/tests/test_worker_client.py	Extends warm-resume parsing test to cover new metrics fields.
examples/llm_server/python/tests/test_session_runtime.py	Verifies `SessionRuntime` propagates the new stats fields end-to-end.
examples/llm_server/python/session_runtime.py	Adds new fields to `GenStats` and copies them in `stats_cb`.
examples/llm_server/python/serving_chat.py	Adds per-turn INFO logging for the newly propagated metrics.
examples/llm_server/cpp/worker_loop.h	Measures and reports prefill/decode/total timings and token rates in worker done JSONL.

Comments suppressed due to low confidence (1)

examples/llm_server/cpp/worker_loop.h:1

The prefill_ms measurement currently includes time between prefill_tokens(...) returning and decode_start being captured (e.g., resident-token bookkeeping). If the intent is “prefill duration”, decode_start (or a separate prefill_end) should be captured immediately after prefill_tokens(...) completes, and any post-prefill bookkeeping time should either be accounted separately or included explicitly in total_ms only.

/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  const double prefill_ms =
+      std::chrono::duration<double, std::milli>(decode_start - prefill_start)
+          .count();


+        session_id: Optional[str], stats: GenStats, finish: str
+    ) -> None:
+        logger.info(
+            "llm_turn_stats session_id=%s reason=%s prompt_tokens=%d "


+    def _log_generation_stats(
+        session_id: Optional[str], stats: GenStats, finish: str
+    ) -> None:
+        logger.info(


claude · 2026-06-23T20:34:19Z

Claude finished @mergennachin's task in 1m 1s —— View job

Code Review: per-turn serving metrics

I reviewed the diff (git show HEAD) and the surrounding code in worker_loop.h and serving_chat.py.

Overall this is a clean, low-risk change: new fields default to 0.0 everywhere and are parsed defensively with .get(..., 0.0) / getattr(..., 0.0), so it's backward-compatible with older workers and adds no risk to the OpenAI-compatible payload. Tests cover both the WorkerClient parse and the SessionRuntime propagation. Nice.

A few observations:

1. `prefill_tok_s` denominator vs. numerator semantics — worth a comment

worker_loop.h:321 computes prefill_tok_s from prefilled (tokens actually prefilled, excluding warm-resumed/reused). That's the right choice for measuring prefill work throughput, but note it diverges from prompt_tokens. On a warm-resume turn with few prefilled tokens, prefill_tok_s can look surprisingly low/high because the denominator (prefill_ms) still includes fixed per-call overhead. Since the whole point is observing warm-resume behavior, a one-line comment clarifying "rate over prefilled (not reused) tokens" would prevent misreading the metric. Minor.

2. `prefill_ms` boundary (Copilot's point) — negligible here

decode_start (worker_loop.h:220) is captured after st.resident_token_ids = ids; (line 219), so prefill_ms technically includes that vector assignment. In practice this is a single vector copy of bounded size — orders of magnitude below prefill_tokens(), so I wouldn't move it. Not worth changing.

3. Log key `reason=` is ambiguous (agree with Copilot)

serving_chat.py:191 — the key reason=%s is bound to stats.session_reset_reason, but the same line ends with finish=%s. Two "reasons" in one line is easy to misread when grepping logs. Suggest reset_reason= or session_reset_reason=.

Fix this →

4. INFO log volume (agree with Copilot, with nuance)

One INFO line per turn (serving_chat.py:506, 650) is reasonable for a demo/eval harness — that's the stated intent. But for chatty multi-turn local agents it can crowd out other logs. Since both the streaming and non-streaming paths log, consider logger.debug by default, or gating behind an env var (e.g. LLM_SERVER_LOG_TURN_STATS). Your call given the example-harness scope.

5. Confirmed: logged exactly once per turn

_log_generation_stats is invoked on the non-streaming path (serving_chat.py:506) and the streaming path (serving_chat.py:650), which are mutually exclusive — so no double-logging. Good.

Minor

Consider asserting one of the new fields appears in the emitted log line in a test, since the formatting (%.1f of float fields) is the one bit not directly exercised. Optional.

No blocking issues — items 3 and 4 are the only ones I'd act on, and both are nits.
· llm-server-turn-stats-logging

Copilot AI review requested due to automatic review settings June 23, 2026 20:12

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 23, 2026

mergennachin temporarily deployed to cadence June 23, 2026 20:12 — with GitHub Actions Inactive

mergennachin requested review from Gasoonjia, digantdesai and metascroy June 23, 2026 20:13

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Copilot started reviewing on behalf of mergennachin June 23, 2026 20:33 View session

metascroy approved these changes Jun 24, 2026

View reviewed changes

mergennachin merged commit d9f3278 into main Jun 24, 2026
196 of 206 checks passed

mergennachin deleted the llm-server-turn-stats-logging branch June 24, 2026 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples/llm_server: log per-turn serving metrics#20455

examples/llm_server: log per-turn serving metrics#20455
mergennachin merged 1 commit into
mainfrom
llm-server-turn-stats-logging

mergennachin commented Jun 23, 2026

Uh oh!

pytorch-bot Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

mergennachin commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

claude Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mergennachin commented Jun 23, 2026

Uh oh!

pytorch-bot Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20455

❗ 1 Active SEVs

❌ 5 New Failures, 3 Unrelated Failures

Uh oh!

mergennachin commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

claude Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: per-turn serving metrics

1. prefill_tok_s denominator vs. numerator semantics — worth a comment

2. prefill_ms boundary (Copilot's point) — negligible here

3. Log key reason= is ambiguous (agree with Copilot)

4. INFO log volume (agree with Copilot, with nuance)

5. Confirmed: logged exactly once per turn

Minor

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot Bot commented Jun 23, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 23, 2026 •

edited

Loading

1. `prefill_tok_s` denominator vs. numerator semantics — worth a comment

2. `prefill_ms` boundary (Copilot's point) — negligible here

3. Log key `reason=` is ambiguous (agree with Copilot)