Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race by CRZbulabula · Pull Request #17878 · apache/iotdb

CRZbulabula · 2026-06-09T07:47:12Z

Description

This PR hardens ConfigNode RegionGroup / partition handling and clarifies the RemoveDataNode error, in two independent areas.

RemoveDataNode (single replica)

Reject RemoveDataNode early when data_replication_factor is 1 or an existing DataRegion has only one replica (hasSingleDataRegionReplica), so the request fails fast with a clear, user-facing reason instead of a generic "failed to remove" message.
Reword FAILED_TO_REMOVE_DATA_NODE_BECAUSE_DATA_REPLICATION_FACTOR_IS_ONE (en + zh): the message now states the cause and ends with "Removing DataNodes is not supported with single replica.", dropping the previous misleading data-loss / "increase data_replication_factor" tail.
Add IT coverage (IoTDBRemoveDataNodeNormalIT#failWhenDataReplicationFactorIsOneUseSQL) asserting the new single-replica error message.

RegionGroup creation robustness + race diagnostics

Propagate RegionGroup persistence failures: ConfigNodeProcedureEnv.persistRegionGroup now returns a TSStatus, and CreateRegionGroupsProcedure fails the procedure (setFailure) instead of continuing to activate RegionGroups after a failed consensus write.
Add a single, failure-path-only diagnostic for the intermittent "no available SchemaRegionGroups" race seen in CI (e.g. IoTDBRawQueryWithoutValueFilterWithDeletionIT). Right before getSortedRegionGroupSlotsCounter throws NoAvailableRegionGroupException, it now logs (WARN, once) every RegionGroup visible in PartitionInfo for the Database together with its LoadCache status. An empty set tells us PartitionInfo has not exposed the new RegionGroup yet; a non-empty all-Disabled set tells us the RegionGroups exist but LoadCache still marks them unavailable. This pinpoints the root cause on the next reproduction without flooding the log (it only fires when allocation already failed).

Note: an earlier revision of this PR added a fixed-timeout busy-wait (waitForRegionGroupsVisible) in PartitionManager to wait for newly created RegionGroups to become visible. That has been removed: on the observed code path it only masks the race probabilistically rather than fixing it, so this PR ships the targeted diagnostic instead and leaves the actual root-cause fix to a follow-up once the next repro confirms which scenario occurs.

Tests

mvn compile -pl iotdb-core/confignode
mvn verify -DskipUTs -Drat.skip=true -Dit.test=IoTDBRemoveDataNodeNormalIT#failWhenDataReplicationFactorIsOneUseSQL -DfailIfNoTests=false -Dfailsafe.failIfNoSpecifiedTests=false -pl integration-test -am -PClusterIT -P with-integration-tests

🤖 Generated with Claude Code

codecov · 2026-06-09T08:39:27Z

Codecov Report

❌ Patch coverage is 0% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.67%. Comparing base (c3e74a2) to head (9c46d01).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...onfignode/procedure/env/RemoveDataNodeHandler.java	0.00%	31 Missing ⚠️
...confignode/manager/partition/PartitionManager.java	0.00%	6 Missing ⚠️
...nfignode/procedure/env/ConfigNodeProcedureEnv.java	0.00%	4 Missing ⚠️
...edure/impl/region/CreateRegionGroupsProcedure.java	0.00%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17878      +/-   ##
============================================
+ Coverage     40.54%   40.67%   +0.12%     
+ Complexity     2622     2620       -2     
============================================
  Files          5244     5244              
  Lines        362367   362397      +30     
  Branches      46651    46651              
============================================
+ Hits         146938   147409     +471     
+ Misses       215429   214988     -441

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…clarify RemoveDataNode message Drop the 10s busy-wait waitForRegionGroupsVisible loop in PartitionManager: the schema-region create/activate/allocate path is unchanged from the PR baseline, so the poll only masks the intermittent "no available RegionGroup" race probabilistically rather than fixing it. Instead, log a single WARN on the failure path (right before throwing NoAvailableRegionGroupException) that dumps every RegionGroup visible in PartitionInfo for the Database and its LoadCache status. This pinpoints, on the next CI repro, whether PartitionInfo has no RegionGroup yet or has some that are all Disabled — without flooding the log, since it only fires when allocation already failed. Also reword FAILED_TO_REMOVE_DATA_NODE_BECAUSE_DATA_REPLICATION_FACTOR_IS_ONE (en + zh): drop the misleading data-loss / increase-factor tail and end with "Removing DataNodes is not supported with single replica."

The previous guard rejected removing any DataNode whenever the cluster kept a single replica (data_replication_factor == 1), which broke IoTDBClusterNodeGetterIT.queryAndRemoveDataNodeTest (2C2D, single replica, removing 1 of 2 DataNodes is legal and must return SUCCESS). Drop the blanket hasSingleDataRegionReplica() guard and reuse the existing capacity check: removal is rejected only when it would leave fewer than NodeInfo.getMinimumDataNode() DataNodes. Under a true single replica (MINIMUM_DATANODE == max(schema, data) == 1) that means only the last remaining DataNode cannot be removed, with a dedicated message. Updated the failing-path IT accordingly.

sonarqubecloud · 2026-06-10T06:01:22Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

CRZbulabula changed the title ~~Fix SchemaRegion visibility race and clarify RemoveDataNode error~~ Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race Jun 9, 2026

CRZbulabula added 2 commits June 9, 2026 19:21

Fix schema region visibility race and remove datanode message

96b3c05

CRZbulabula force-pushed the fix-schema-region-partition-race-remove-message branch from 977ebb7 to 9c46d01 Compare June 9, 2026 11:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race#17878

Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race#17878
CRZbulabula wants to merge 3 commits into
masterfrom
fix-schema-region-partition-race-remove-message

CRZbulabula commented Jun 9, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CRZbulabula commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Uh oh!

codecov Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented Jun 10, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CRZbulabula commented Jun 9, 2026 •

edited

Loading

codecov Bot commented Jun 9, 2026 •

edited

Loading