Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race#17878
Open
CRZbulabula wants to merge 3 commits into
Open
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17878 +/- ##
============================================
+ Coverage 40.54% 40.67% +0.12%
+ Complexity 2622 2620 -2
============================================
Files 5244 5244
Lines 362367 362397 +30
Branches 46651 46651
============================================
+ Hits 146938 147409 +471
+ Misses 215429 214988 -441 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…clarify RemoveDataNode message Drop the 10s busy-wait waitForRegionGroupsVisible loop in PartitionManager: the schema-region create/activate/allocate path is unchanged from the PR baseline, so the poll only masks the intermittent "no available RegionGroup" race probabilistically rather than fixing it. Instead, log a single WARN on the failure path (right before throwing NoAvailableRegionGroupException) that dumps every RegionGroup visible in PartitionInfo for the Database and its LoadCache status. This pinpoints, on the next CI repro, whether PartitionInfo has no RegionGroup yet or has some that are all Disabled — without flooding the log, since it only fires when allocation already failed. Also reword FAILED_TO_REMOVE_DATA_NODE_BECAUSE_DATA_REPLICATION_FACTOR_IS_ONE (en + zh): drop the misleading data-loss / increase-factor tail and end with "Removing DataNodes is not supported with single replica."
977ebb7 to
9c46d01
Compare
The previous guard rejected removing any DataNode whenever the cluster kept a single replica (data_replication_factor == 1), which broke IoTDBClusterNodeGetterIT.queryAndRemoveDataNodeTest (2C2D, single replica, removing 1 of 2 DataNodes is legal and must return SUCCESS). Drop the blanket hasSingleDataRegionReplica() guard and reuse the existing capacity check: removal is rejected only when it would leave fewer than NodeInfo.getMinimumDataNode() DataNodes. Under a true single replica (MINIMUM_DATANODE == max(schema, data) == 1) that means only the last remaining DataNode cannot be removed, with a dedicated message. Updated the failing-path IT accordingly.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Description
This PR hardens ConfigNode RegionGroup / partition handling and clarifies the RemoveDataNode error, in two independent areas.
RemoveDataNode (single replica)
RemoveDataNodeearly whendata_replication_factoris 1 or an existing DataRegion has only one replica (hasSingleDataRegionReplica), so the request fails fast with a clear, user-facing reason instead of a generic "failed to remove" message.FAILED_TO_REMOVE_DATA_NODE_BECAUSE_DATA_REPLICATION_FACTOR_IS_ONE(en + zh): the message now states the cause and ends with "Removing DataNodes is not supported with single replica.", dropping the previous misleading data-loss / "increase data_replication_factor" tail.IoTDBRemoveDataNodeNormalIT#failWhenDataReplicationFactorIsOneUseSQL) asserting the new single-replica error message.RegionGroup creation robustness + race diagnostics
ConfigNodeProcedureEnv.persistRegionGroupnow returns aTSStatus, andCreateRegionGroupsProcedurefails the procedure (setFailure) instead of continuing to activate RegionGroups after a failed consensus write.IoTDBRawQueryWithoutValueFilterWithDeletionIT). Right beforegetSortedRegionGroupSlotsCounterthrowsNoAvailableRegionGroupException, it now logs (WARN, once) every RegionGroup visible inPartitionInfofor the Database together with its LoadCache status. An empty set tells us PartitionInfo has not exposed the new RegionGroup yet; a non-empty all-Disabledset tells us the RegionGroups exist but LoadCache still marks them unavailable. This pinpoints the root cause on the next reproduction without flooding the log (it only fires when allocation already failed).Note: an earlier revision of this PR added a fixed-timeout busy-wait (
waitForRegionGroupsVisible) inPartitionManagerto wait for newly created RegionGroups to become visible. That has been removed: on the observed code path it only masks the race probabilistically rather than fixing it, so this PR ships the targeted diagnostic instead and leaves the actual root-cause fix to a follow-up once the next repro confirms which scenario occurs.Tests
mvn compile -pl iotdb-core/confignodemvn verify -DskipUTs -Drat.skip=true -Dit.test=IoTDBRemoveDataNodeNormalIT#failWhenDataReplicationFactorIsOneUseSQL -DfailIfNoTests=false -Dfailsafe.failIfNoSpecifiedTests=false -pl integration-test -am -PClusterIT -P with-integration-tests🤖 Generated with Claude Code