Skip to content

test(integration): de-flake otel pre-job config test#803

Open
cbartz wants to merge 2 commits into
mainfrom
fix/deflake-otel-pre-job-test
Open

test(integration): de-flake otel pre-job config test#803
cbartz wants to merge 2 commits into
mainfrom
fix/deflake-otel-pre-job-test

Conversation

@cbartz

@cbartz cbartz commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

What this PR does

De-flakes test_otel_collector_endpoint_pre_job_installs_config. Instead of dispatching a quick workflow, waiting for it to complete, then SSHing into the runner, it now dispatches the long-running wait workflow and inspects the runner while its job is in progress. The config read is polled to absorb the timing of the pre-job hook write.

Why we need it

The runner is ephemeral, so its OpenStack VM is torn down on job completion. Reading the otel config after completion races the runner-manager cleanup loop — when cleanup wins, get_single_runner finds zero VMs and the test fails with found more than one runners or no runners: []. This was observed failing on the 22.04 base while passing on 24.04 in the same run (e.g. PR #802 CI). Keeping the job in progress guarantees the VM is alive during inspection.

Checklist

  • I followed the contributing guide
  • I added or updated the documentation (if applicable) — N/A, test-only change
  • I updated docs/changelog.md with user-relevant changes — N/A, no user-facing change
  • I used AI to assist with preparing this PR
  • I added or updated tests as needed (unit and integration)
  • If this is a Grafana dashboard: I added a screenshot of the dashboard — N/A
  • If this is Terraform: terraform fmt passes and tflint reports no errors — N/A
  • If the github-runner-manager application has been changed: version updated in github-runner-manager/pyproject.toml — N/A, not changed

The test dispatched a quick workflow, waited for it to complete, then SSHed
into the runner to read the otel config the pre-job script wrote. Ephemeral
runner VMs are torn down on job completion, so the inspection raced the
runner-manager cleanup: when cleanup won, get_single_runner found zero VMs
and the test failed with an empty runner list.

Dispatch the long-running wait workflow and inspect the runner while its job
is in progress, so the VM is guaranteed alive. Poll on the config file to
absorb the pre-job hook write timing.
@cbartz cbartz marked this pull request as ready for review July 2, 2026 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants