Skip to content

Pre-warm cold OptimizedDirectorySourceLocator caches before spawning workers#5846

Open
SanderMuller wants to merge 1 commit into
phpstan:2.2.xfrom
SanderMuller:perf/parallel-symbol-scanning
Open

Pre-warm cold OptimizedDirectorySourceLocator caches before spawning workers#5846
SanderMuller wants to merge 1 commit into
phpstan:2.2.xfrom
SanderMuller:perf/parallel-symbol-scanning

Conversation

@SanderMuller

@SanderMuller SanderMuller commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What & why

On a cold cache, every parallel worker independently rebuilds the directory source
locators for the analysed paths: instrumenting OptimizedDirectorySourceLocatorFactory
on current 2.2.x shows all 8 workers scanning the analysed directory on a cold run while
the main process never does.

This PR fills the OptimizedDirectorySourceLocator symbol cache for analysed directories
in the main thread before spawning workers — but only for directories that have no
cache entry yet
. Directories with an existing entry are skipped with a single cache
read, workers validate those themselves exactly as today, and nothing else of the source
locator is initialized (no identifier is located, no composer processing, no stubbers).
That keeps the main thread as lazy as #5577 made it: fully warm runs spawn no workers and
never reach this code, and incremental runs pay one cache read.

Benchmarks

hyperfine, self-analysis of src/Type at level 8, M4 Pro / PHP 8.5.7, base = ff2647a
(cold = tmpDir wiped per run; incremental = one file modified per run, caches warm):

scenario base PR delta
cold, default spawn 8.602 ± 0.117 s 8.610 ± 0.121 s wall ±0, user CPU −2.4%
cold, PHPSTAN_PARALLEL_FORK=1 8.754 ± 0.348 s 8.338 ± 0.100 s wall −4.8%, user CPU −6.7%
incremental (1 file), warm 1.655 ± 0.056 s 1.650 ± 0.044 s unchanged

The cold CPU savings on this corpus are modest because the analysed directory
(src/Type) is small relative to the unanalysed classmap directories, which this PR
deliberately leaves alone. On projects where the analysed paths cover most of the
scanned tree, the deduplicated share grows accordingly.

Tests

make tests (12,711, green), make phpstan, make cs, make lint and
make composer-dependency-analyser all pass. Analysis output is byte-identical. The
result-cache-restore-without-reflection e2e (ThrowingSourceLocator) passes — the
pre-warm never locates an identifier.

🤖 Generated with Claude Code

@SanderMuller SanderMuller force-pushed the perf/parallel-symbol-scanning branch from 32188a1 to aaf28cf Compare June 11, 2026 08:34

@ondrejmirtes ondrejmirtes left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please only contribute the file hash algorithm change.

Source locator caches are already prewarmed, feel free to verify my claims, but the way it works today is that the main thread create a file on disk for the most expensive locator (OptimizedDirectorySourceLocator) and the child processes only read that which pretty fast.

&& function_exists('pcntl_waitpid')
&& !$this->isOpcacheEnabled()
) {
return $this->hashAndFindSymbolsParallel($filesWithCachedHashes, $supportsEnums);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is worth it. I don't want to have yet-another parallel mechanism hidden inside here. I think the gains from this would pretty minimal.

$hashes = [];
foreach ($this->allConfigFiles as $file) {
$hash = hash_file('sha256', $file);
$hash = hash_file('xxh128', $file);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd welcome this but it'd be nice to have a new FileHasher class (akin to FileReader) and we'd have to switch the hash based on PHP_VERSION_ID (This is available only since 80100).

@SanderMuller SanderMuller force-pushed the perf/parallel-symbol-scanning branch from aaf28cf to 61c3e2b Compare June 11, 2026 09:30
@SanderMuller SanderMuller changed the title Pre-warm source locator caches and parallelize symbol scanning Pre-warm source locator caches before spawning parallel workers Jun 11, 2026
@SanderMuller

Copy link
Copy Markdown
Contributor Author

I took the invitation to verify seriously — both directions. Three changes to the PR based on what I found:

  1. Hash change removed from this PR — that lives in Use xxh128 instead of sha256 for file content hashing #5842 now; this PR no longer includes it.
  2. The parallel symbol scanning commit is dropped. Standalone hyperfine showed it adds no wall time in spawn mode and makes fork mode unstable (8.752 ± 2.384 s, range 7.2–13.1 s — some interaction between the scan children and forked workers). Not worth it in this form.
  3. What remains is only the pre-warm: parent builds the source locator once before spawning workers.

On the "source locator caches are already prewarmed" claim — I instrumented OptimizedDirectorySourceLocatorFactory::createByDirectory()/createByFiles() with pid logging on current 2.2.x (ff2647a) and ran a cold analyse of src/Type with default settings:

distinct scanning pids: 8        (all children of the phpstan main pid)
scans by the main process: 0
scans per worker: 44 directories + 1 classmap file set
total: 360 scan operations for one cold run

So on a cold cache the main thread does not create the cache file first — all 8 workers race the same scans (hash + symbol extraction of ~6,800 files each, including the autoload-dev tests/ tree) and each writes the cache the others are also computing. What you describe — children only reading a file written earlier — is the steady state from the second run onwards, and on a fully warm run (valid result cache) nobody scans at all, which this PR doesn't change.

End-to-end, measured with hyperfine (cold, tmpDir wiped per run, src/Type at level 8, M4 Pro / PHP 8.5.7):

scenario base prewarm delta
default settings 8.694 ± 0.125 s 9.022 ± 0.358 s wall +3.8%, user CPU −15% (36.1 → 30.7 s)
PHPSTAN_PARALLEL_FORK=1 8.524 ± 0.145 s 7.812 ± 0.108 s wall −8.4%, user CPU −21%

So the honest end-to-end story in default mode: no wall-time win (the duplicated scans overlap across workers), but ~15% less total CPU per cold run — which matters for billed CI minutes and contended runners. With forked workers it is a wall win too. If a CPU-only improvement in the default mode doesn't clear the bar for you, I'm fine closing this; the trace above might still be useful as a data point about cold-run behaviour.

@ondrejmirtes

Copy link
Copy Markdown
Member

This goes against the latest fixes we did. This will actually perform much worse in some cases because typically the main thread doesn't need to initialize the source locator at all. See #5577.

It'd make sense to still initialize just the OptimizedDirectorySourceLocator, so that child workers save time.

@SanderMuller

Copy link
Copy Markdown
Contributor Author

This goes against the latest fixes we did. This will actually perform much worse in some cases because typically the main thread doesn't need to initialize the source locator at all. See #5577.

It'd make sense to still initialize just the OptimizedDirectorySourceLocator, so that child workers save time.

Will ook into that case, thanks for the reference

Building the source locator eagerly scans the analysed directories and
composer classmap directories and writes the shared file cache. Without
this, every parallel worker redoes the same symbol scan against a cold
cache: measured 8x duplicated scanning CPU at 8 workers, and removing
this pre-warm costs +51% wall / +86% CPU on a 14-core cold run of
src/Type (with PHPSTAN_PARALLEL_FORK=1, forked workers additionally
inherit the warm in-memory state via copy-on-write).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@SanderMuller SanderMuller force-pushed the perf/parallel-symbol-scanning branch from 61c3e2b to 11d3e17 Compare June 11, 2026 10:03
@SanderMuller SanderMuller changed the title Pre-warm source locator caches before spawning parallel workers Pre-warm cold OptimizedDirectorySourceLocator caches before spawning workers Jun 11, 2026
@SanderMuller

Copy link
Copy Markdown
Contributor Author

Thanks — I read #5577 and phpstan/phpstan#14072 properly now, and I see the conflict: my version forced the full buildSourceLocator() in the main thread on every parallel run, which is exactly the eager initialization #5577 removed to speed up bootstrapping (and the main thread typically never needs any of it — the composer processing, stub locators, none of that belongs there).

Reworked as you suggested, with one refinement: the pre-warm now touches only the OptimizedDirectorySourceLocator caches for analysed directories, and only the ones that are cold (prewarmIfCold(): one cache read; if an entry exists, do nothing and let workers validate it as today). So:

  • fully warm runs: unreachable (no workers spawn);
  • incremental runs: one cache read per analysed directory, measured unchanged (1.655 ± 0.056 s base vs 1.650 ± 0.044 s, one-file change on src/Type);
  • cold runs: the parent scans once instead of every worker doing it — user CPU −2.4% in default spawn mode (wall ±0, the worker scans overlapped anyway) and wall −4.8% / CPU −6.7% with PHPSTAN_PARALLEL_FORK=1;
  • nothing else of the aggregate is built and no identifier is located, so the lazy main thread from Lazily initialize AggregateSourceLocator to speedup bootstrapping #5577 stays lazy and the ThrowingSourceLocator e2e passes.

The cold numbers on this corpus are modest because src/Type is small next to the unanalysed classmap directories, which I deliberately did not touch — warming those would mean running the composer maker in the main thread, which felt like re-introducing what #5577 removed. If you'd want the classmap directories covered too (they are the bulk of the duplicated cold-run work in this repo: 43 of the 44 scans per worker), that would need a way to enumerate them without building the rest — happy to look at that as a follow-up if you think it's worth it.

@SanderMuller

Copy link
Copy Markdown
Contributor Author

Adding make phpstan numbers for the reworked version (hyperfine, two worktrees, M4 Pro / PHP 8.5.7):

base PR
warm file caches (6 runs) 19.10 ± 0.26 s 19.30 ± 0.27 s
cold file caches, tmp/ wiped per run (5 runs) 21.73 ± 0.09 s 21.74 ± 0.04 s

Warm: identical — the cold-probe costs nothing measurable. Cold: wall is also identical (the duplicated worker scans overlap in time), but user CPU drops 128.1 s → 123.8 s (−3.4%). So on this machine the spawn-mode benefit is purely CPU; the wall-time win only shows up with PHPSTAN_PARALLEL_FORK=1 (−4.8% on the smaller corpus, table in the description). Your call whether that's worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants