Skip to content

Keep LAVA infrastructure failures incomplete#3120

Merged
nuclearcat merged 2 commits into
kernelci:mainfrom
nuclearcat:kcidb-infra-error-status
Jun 23, 2026
Merged

Keep LAVA infrastructure failures incomplete#3120
nuclearcat merged 2 commits into
kernelci:mainfrom
nuclearcat:kcidb-infra-error-status

Conversation

@nuclearcat

Copy link
Copy Markdown
Member

Do not convert baseline jobs with Infrastructure error metadata from incomplete to fail when setup/login/kernel-message stages fail early. This preserves the callback status so downstream KCIDB reporting can publish ERROR instead of FAIL.

Add regressions for early LAVA infrastructure failures with no setup results, login failures, and kernel-message failures.

This is a best-effort fix for the old discussion around issue #1087. I am not fully sure this is the correct final fix because that discussion happened a while ago and I do not remember all the details, so this should get maintainer review against the original failure modes.

@nuclearcat nuclearcat force-pushed the kcidb-infra-error-status branch from 5acacc9 to 61d3ee0 Compare June 13, 2026 12:16
@nuclearcat nuclearcat marked this pull request as ready for review June 23, 2026 06:48
Do not convert baseline jobs with Infrastructure error metadata from incomplete to fail when setup/login/kernel-message stages fail early. This preserves the callback status so downstream KCIDB reporting can publish ERROR instead of FAIL.

Add regressions for early LAVA infrastructure failures with no setup results, login failures, and kernel-message failures.

This is a best-effort fix for the old discussion around issue kernelci#1087. I am not fully sure this is the correct final fix because that discussion happened a while ago and I do not remember all the details, so this should get maintainer review against the original failure modes.

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
@nuclearcat nuclearcat force-pushed the kcidb-infra-error-status branch from 61d3ee0 to 5aca08d Compare June 23, 2026 06:51
get_configs() matched a scheduler entry whenever the current event
values satisfied it (level-triggered). Because a node event is emitted
on every update, any update that left the node in an already-matching
state (an artifact, timeout or flag change on an "available" node)
re-triggered creation of the whole set of child jobs.

A 6-month audit of production confirmed this happens routinely: across
26 sampled days (236,139 job nodes) there were 4,380 duplicate job
groups -- identical parent/name/runtime/platform with the same
retry_counter, created seconds apart -- with the rate rising sharply
from late April 2026, reaching ~10% of jobs on some days.

Make scheduling edge-triggered: fire only on the transition into the
matched condition, using previous_state/previous_result now carried in
the event. Falls back to the previous level-triggered behaviour when
that information is absent (node creation, retry events, older API), so
retries and freshly created nodes are unaffected.

Fixes kernelci#2912

Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
@nuclearcat nuclearcat force-pushed the kcidb-infra-error-status branch from 5aca08d to ad77864 Compare June 23, 2026 06:56
@nuclearcat nuclearcat added this pull request to the merge queue Jun 23, 2026
Merged via the queue into kernelci:main with commit 60de9d3 Jun 23, 2026
3 checks passed
@nuclearcat nuclearcat deleted the kcidb-infra-error-status branch June 23, 2026 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant