The build_windows check fails intermittantly. Here is an example:
https://github.com/Azure/azihsm-sdk/actions/runs/26306854393/attempts/1?pr=384
Copilots explanation provides this:
The failure is in init_resiliency_stress.cert_chain_under_reset in api/tests/cpp/resiliency/init_stress_tests.cpp.
What failed
The log shows a single transient failure during the fetch step:
cert_chain fetch 246 got -4
then the test fails because it requires all 500 operations to succeed:
Too many operations failed (ok=499, min_expected=500)
The relevant code is here:
size query and fetch loop: init_stress_tests.cpp:46-64
strict success threshold: resiliency_stress_helpers.hpp:179-210
This is running in the Windows coverage job defined in .github/workflows/rust.yml, where build_windows executes cargo xtask precheck --coverage.
Root cause
cert_chain_under_reset uses a two-call pattern:
query required buffer size
fetch the property
Because resets are happening concurrently, the second call can race with a reset and return a transient error. The test currently treats any such transient failure as fatal for the success count, unlike init_part_under_reset, which already allows a small number of transient failures.
So the issue is not the workflow itself; it is that the test is too strict for a reset-stress scenario.
The build_windows check fails intermittantly. Here is an example:
https://github.com/Azure/azihsm-sdk/actions/runs/26306854393/attempts/1?pr=384
Copilots explanation provides this:
The failure is in init_resiliency_stress.cert_chain_under_reset in api/tests/cpp/resiliency/init_stress_tests.cpp.
What failed
The log shows a single transient failure during the fetch step:
cert_chain fetch 246 got -4
then the test fails because it requires all 500 operations to succeed:
Too many operations failed (ok=499, min_expected=500)
The relevant code is here:
size query and fetch loop: init_stress_tests.cpp:46-64
strict success threshold: resiliency_stress_helpers.hpp:179-210
This is running in the Windows coverage job defined in .github/workflows/rust.yml, where build_windows executes cargo xtask precheck --coverage.
Root cause
cert_chain_under_reset uses a two-call pattern:
query required buffer size
fetch the property
Because resets are happening concurrently, the second call can race with a reset and return a transient error. The test currently treats any such transient failure as fatal for the success count, unlike init_part_under_reset, which already allows a small number of transient failures.
So the issue is not the workflow itself; it is that the test is too strict for a reset-stress scenario.