examples/llm_server: log per-turn serving metrics#20455
Conversation
Add lightweight turn-level observability to the example LLM server harness. The C++ worker now reports warm-resume accounting, prefill/decode duration, total worker time, and token rates in the terminal JSONL done message. The Python WorkerClient and SessionRuntime propagate those fields, and ServingChat emits one structured INFO log line per completed turn. This is intended for local-agent demos and evals where we need to understand whether warm resume is firing and where time is spent, without changing the OpenAI-compatible response payload.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20455
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 5 New Failures, 3 Unrelated FailuresAs of commit 6454ad0 with merge base 58447b2 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@claude Review |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.
Adds per-turn serving metrics to the example LLM server, propagating worker-reported timing/token-rate stats through the Python harness and emitting one log line per completed turn for observability during local demos/evals.
Changes:
- Extend worker “done” JSONL payload with prefill/decode/total timings and token/sec rates (C++).
- Parse/propagate the new metrics through
WorkerClientandSessionRuntime(Python) with updated tests. - Emit per-turn generation metrics via a structured-ish INFO log line in
ServingChat.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/llm_server/python/worker_client.py | Adds new timing/token-rate fields to WorkerStats and parses them from done messages. |
| examples/llm_server/python/tests/test_worker_client.py | Extends warm-resume parsing test to cover new metrics fields. |
| examples/llm_server/python/tests/test_session_runtime.py | Verifies SessionRuntime propagates the new stats fields end-to-end. |
| examples/llm_server/python/session_runtime.py | Adds new fields to GenStats and copies them in stats_cb. |
| examples/llm_server/python/serving_chat.py | Adds per-turn INFO logging for the newly propagated metrics. |
| examples/llm_server/cpp/worker_loop.h | Measures and reports prefill/decode/total timings and token rates in worker done JSONL. |
Comments suppressed due to low confidence (1)
examples/llm_server/cpp/worker_loop.h:1
- The
prefill_msmeasurement currently includes time betweenprefill_tokens(...)returning anddecode_startbeing captured (e.g., resident-token bookkeeping). If the intent is “prefill duration”,decode_start(or a separateprefill_end) should be captured immediately afterprefill_tokens(...)completes, and any post-prefill bookkeeping time should either be accounted separately or included explicitly intotal_msonly.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const double prefill_ms = | ||
| std::chrono::duration<double, std::milli>(decode_start - prefill_start) | ||
| .count(); |
| session_id: Optional[str], stats: GenStats, finish: str | ||
| ) -> None: | ||
| logger.info( | ||
| "llm_turn_stats session_id=%s reason=%s prompt_tokens=%d " |
| def _log_generation_stats( | ||
| session_id: Optional[str], stats: GenStats, finish: str | ||
| ) -> None: | ||
| logger.info( |
|
Claude finished @mergennachin's task in 1m 1s —— View job Code Review: per-turn serving metricsI reviewed the diff ( Overall this is a clean, low-risk change: new fields default to A few observations: 1.
|
Add lightweight turn-level observability to the example LLM server harness. The C++ worker now reports warm-resume accounting, prefill/decode duration, total worker time, and token rates in the terminal JSONL done message. The Python WorkerClient and SessionRuntime propagate those fields, and ServingChat emits one structured INFO log line per completed turn.
This is intended for local-agent demos and evals where we need to understand whether warm resume is firing and where time is spent, without changing the OpenAI-compatible response payload.