Skip to content

Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race#17878

Open
CRZbulabula wants to merge 3 commits into
masterfrom
fix-schema-region-partition-race-remove-message
Open

Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race#17878
CRZbulabula wants to merge 3 commits into
masterfrom
fix-schema-region-partition-race-remove-message

Conversation

@CRZbulabula

@CRZbulabula CRZbulabula commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Description

This PR hardens ConfigNode RegionGroup / partition handling and clarifies the RemoveDataNode error, in two independent areas.

RemoveDataNode (single replica)

  • Reject RemoveDataNode early when data_replication_factor is 1 or an existing DataRegion has only one replica (hasSingleDataRegionReplica), so the request fails fast with a clear, user-facing reason instead of a generic "failed to remove" message.
  • Reword FAILED_TO_REMOVE_DATA_NODE_BECAUSE_DATA_REPLICATION_FACTOR_IS_ONE (en + zh): the message now states the cause and ends with "Removing DataNodes is not supported with single replica.", dropping the previous misleading data-loss / "increase data_replication_factor" tail.
  • Add IT coverage (IoTDBRemoveDataNodeNormalIT#failWhenDataReplicationFactorIsOneUseSQL) asserting the new single-replica error message.

RegionGroup creation robustness + race diagnostics

  • Propagate RegionGroup persistence failures: ConfigNodeProcedureEnv.persistRegionGroup now returns a TSStatus, and CreateRegionGroupsProcedure fails the procedure (setFailure) instead of continuing to activate RegionGroups after a failed consensus write.
  • Add a single, failure-path-only diagnostic for the intermittent "no available SchemaRegionGroups" race seen in CI (e.g. IoTDBRawQueryWithoutValueFilterWithDeletionIT). Right before getSortedRegionGroupSlotsCounter throws NoAvailableRegionGroupException, it now logs (WARN, once) every RegionGroup visible in PartitionInfo for the Database together with its LoadCache status. An empty set tells us PartitionInfo has not exposed the new RegionGroup yet; a non-empty all-Disabled set tells us the RegionGroups exist but LoadCache still marks them unavailable. This pinpoints the root cause on the next reproduction without flooding the log (it only fires when allocation already failed).

Note: an earlier revision of this PR added a fixed-timeout busy-wait (waitForRegionGroupsVisible) in PartitionManager to wait for newly created RegionGroups to become visible. That has been removed: on the observed code path it only masks the race probabilistically rather than fixing it, so this PR ships the targeted diagnostic instead and leaves the actual root-cause fix to a follow-up once the next repro confirms which scenario occurs.

Tests

  • mvn compile -pl iotdb-core/confignode
  • mvn verify -DskipUTs -Drat.skip=true -Dit.test=IoTDBRemoveDataNodeNormalIT#failWhenDataReplicationFactorIsOneUseSQL -DfailIfNoTests=false -Dfailsafe.failIfNoSpecifiedTests=false -pl integration-test -am -PClusterIT -P with-integration-tests

🤖 Generated with Claude Code

@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.67%. Comparing base (c3e74a2) to head (9c46d01).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...onfignode/procedure/env/RemoveDataNodeHandler.java 0.00% 31 Missing ⚠️
...confignode/manager/partition/PartitionManager.java 0.00% 6 Missing ⚠️
...nfignode/procedure/env/ConfigNodeProcedureEnv.java 0.00% 4 Missing ⚠️
...edure/impl/region/CreateRegionGroupsProcedure.java 0.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17878      +/-   ##
============================================
+ Coverage     40.54%   40.67%   +0.12%     
+ Complexity     2622     2620       -2     
============================================
  Files          5244     5244              
  Lines        362367   362397      +30     
  Branches      46651    46651              
============================================
+ Hits         146938   147409     +471     
+ Misses       215429   214988     -441     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@CRZbulabula CRZbulabula changed the title Fix SchemaRegion visibility race and clarify RemoveDataNode error Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race Jun 9, 2026
…clarify RemoveDataNode message

Drop the 10s busy-wait waitForRegionGroupsVisible loop in PartitionManager:
the schema-region create/activate/allocate path is unchanged from the PR
baseline, so the poll only masks the intermittent "no available RegionGroup"
race probabilistically rather than fixing it. Instead, log a single WARN on
the failure path (right before throwing NoAvailableRegionGroupException) that
dumps every RegionGroup visible in PartitionInfo for the Database and its
LoadCache status. This pinpoints, on the next CI repro, whether PartitionInfo
has no RegionGroup yet or has some that are all Disabled — without flooding
the log, since it only fires when allocation already failed.

Also reword FAILED_TO_REMOVE_DATA_NODE_BECAUSE_DATA_REPLICATION_FACTOR_IS_ONE
(en + zh): drop the misleading data-loss / increase-factor tail and end with
"Removing DataNodes is not supported with single replica."
@CRZbulabula CRZbulabula force-pushed the fix-schema-region-partition-race-remove-message branch from 977ebb7 to 9c46d01 Compare June 9, 2026 11:30
The previous guard rejected removing any DataNode whenever the cluster
kept a single replica (data_replication_factor == 1), which broke
IoTDBClusterNodeGetterIT.queryAndRemoveDataNodeTest (2C2D, single
replica, removing 1 of 2 DataNodes is legal and must return SUCCESS).

Drop the blanket hasSingleDataRegionReplica() guard and reuse the
existing capacity check: removal is rejected only when it would leave
fewer than NodeInfo.getMinimumDataNode() DataNodes. Under a true single
replica (MINIMUM_DATANODE == max(schema, data) == 1) that means only the
last remaining DataNode cannot be removed, with a dedicated message.
Updated the failing-path IT accordingly.
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant