Skip to content

examples/llm_server: log per-turn serving metrics#20455

Merged
mergennachin merged 1 commit into
mainfrom
llm-server-turn-stats-logging
Jun 24, 2026
Merged

examples/llm_server: log per-turn serving metrics#20455
mergennachin merged 1 commit into
mainfrom
llm-server-turn-stats-logging

Conversation

@mergennachin

Copy link
Copy Markdown
Contributor

Add lightweight turn-level observability to the example LLM server harness. The C++ worker now reports warm-resume accounting, prefill/decode duration, total worker time, and token rates in the terminal JSONL done message. The Python WorkerClient and SessionRuntime propagate those fields, and ServingChat emits one structured INFO log line per completed turn.

This is intended for local-agent demos and evals where we need to understand whether warm resume is firing and where time is spent, without changing the OpenAI-compatible response payload.

Add lightweight turn-level observability to the example LLM server harness. The C++ worker now reports warm-resume accounting, prefill/decode duration, total worker time, and token rates in the terminal JSONL done message. The Python WorkerClient and SessionRuntime propagate those fields, and ServingChat emits one structured INFO log line per completed turn.

This is intended for local-agent demos and evals where we need to understand whether warm resume is firing and where time is spent, without changing the OpenAI-compatible response payload.
Copilot AI review requested due to automatic review settings June 23, 2026 20:12
@pytorch-bot

pytorch-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20455

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 5 New Failures, 3 Unrelated Failures

As of commit 6454ad0 with merge base 58447b2 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 23, 2026
@mergennachin

Copy link
Copy Markdown
Contributor Author

@claude Review

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Adds per-turn serving metrics to the example LLM server, propagating worker-reported timing/token-rate stats through the Python harness and emitting one log line per completed turn for observability during local demos/evals.

Changes:

  • Extend worker “done” JSONL payload with prefill/decode/total timings and token/sec rates (C++).
  • Parse/propagate the new metrics through WorkerClient and SessionRuntime (Python) with updated tests.
  • Emit per-turn generation metrics via a structured-ish INFO log line in ServingChat.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
examples/llm_server/python/worker_client.py Adds new timing/token-rate fields to WorkerStats and parses them from done messages.
examples/llm_server/python/tests/test_worker_client.py Extends warm-resume parsing test to cover new metrics fields.
examples/llm_server/python/tests/test_session_runtime.py Verifies SessionRuntime propagates the new stats fields end-to-end.
examples/llm_server/python/session_runtime.py Adds new fields to GenStats and copies them in stats_cb.
examples/llm_server/python/serving_chat.py Adds per-turn INFO logging for the newly propagated metrics.
examples/llm_server/cpp/worker_loop.h Measures and reports prefill/decode/total timings and token rates in worker done JSONL.
Comments suppressed due to low confidence (1)

examples/llm_server/cpp/worker_loop.h:1

  • The prefill_ms measurement currently includes time between prefill_tokens(...) returning and decode_start being captured (e.g., resident-token bookkeeping). If the intent is “prefill duration”, decode_start (or a separate prefill_end) should be captured immediately after prefill_tokens(...) completes, and any post-prefill bookkeeping time should either be accounted separately or included explicitly in total_ms only.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +309 to +311
const double prefill_ms =
std::chrono::duration<double, std::milli>(decode_start - prefill_start)
.count();
session_id: Optional[str], stats: GenStats, finish: str
) -> None:
logger.info(
"llm_turn_stats session_id=%s reason=%s prompt_tokens=%d "
def _log_generation_stats(
session_id: Optional[str], stats: GenStats, finish: str
) -> None:
logger.info(
@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown

Claude finished @mergennachin's task in 1m 1s —— View job


Code Review: per-turn serving metrics

I reviewed the diff (git show HEAD) and the surrounding code in worker_loop.h and serving_chat.py.

Overall this is a clean, low-risk change: new fields default to 0.0 everywhere and are parsed defensively with .get(..., 0.0) / getattr(..., 0.0), so it's backward-compatible with older workers and adds no risk to the OpenAI-compatible payload. Tests cover both the WorkerClient parse and the SessionRuntime propagation. Nice.

A few observations:

1. prefill_tok_s denominator vs. numerator semantics — worth a comment

worker_loop.h:321 computes prefill_tok_s from prefilled (tokens actually prefilled, excluding warm-resumed/reused). That's the right choice for measuring prefill work throughput, but note it diverges from prompt_tokens. On a warm-resume turn with few prefilled tokens, prefill_tok_s can look surprisingly low/high because the denominator (prefill_ms) still includes fixed per-call overhead. Since the whole point is observing warm-resume behavior, a one-line comment clarifying "rate over prefilled (not reused) tokens" would prevent misreading the metric. Minor.

2. prefill_ms boundary (Copilot's point) — negligible here

decode_start (worker_loop.h:220) is captured after st.resident_token_ids = ids; (line 219), so prefill_ms technically includes that vector assignment. In practice this is a single vector copy of bounded size — orders of magnitude below prefill_tokens(), so I wouldn't move it. Not worth changing.

3. Log key reason= is ambiguous (agree with Copilot)

serving_chat.py:191 — the key reason=%s is bound to stats.session_reset_reason, but the same line ends with finish=%s. Two "reasons" in one line is easy to misread when grepping logs. Suggest reset_reason= or session_reset_reason=.

Fix this →

4. INFO log volume (agree with Copilot, with nuance)

One INFO line per turn (serving_chat.py:506, 650) is reasonable for a demo/eval harness — that's the stated intent. But for chatty multi-turn local agents it can crowd out other logs. Since both the streaming and non-streaming paths log, consider logger.debug by default, or gating behind an env var (e.g. LLM_SERVER_LOG_TURN_STATS). Your call given the example-harness scope.

5. Confirmed: logged exactly once per turn

_log_generation_stats is invoked on the non-streaming path (serving_chat.py:506) and the streaming path (serving_chat.py:650), which are mutually exclusive — so no double-logging. Good.

Minor

  • Consider asserting one of the new fields appears in the emitted log line in a test, since the formatting (%.1f of float fields) is the one bit not directly exercised. Optional.

No blocking issues — items 3 and 4 are the only ones I'd act on, and both are nits.
· llm-server-turn-stats-logging

@mergennachin mergennachin merged commit d9f3278 into main Jun 24, 2026
196 of 206 checks passed
@mergennachin mergennachin deleted the llm-server-turn-stats-logging branch June 24, 2026 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants