Skip to content

fix: fix confignode and ainode restart function#17862

Open
Alchuang22-dev wants to merge 7 commits into
apache:masterfrom
Alchuang22-dev:fix-restart
Open

fix: fix confignode and ainode restart function#17862
Alchuang22-dev wants to merge 7 commits into
apache:masterfrom
Alchuang22-dev:fix-restart

Conversation

@Alchuang22-dev

Copy link
Copy Markdown
Contributor

Description

Improve restart robustness for ConfigNode and AINode by making partial startup states explicit and preventing unsafe recovery paths.

Content

Here are the changes:

  1. Introduce explicit ConfigNode startup states and make restart detection stricter.
  2. Harden ConsensusManager.start() so consensus initialization failures are propagated instead of being silently ignored.
  3. Persist ConfigNode system.properties only after consensus group creation, leader election, and seed ConfigNode application all succeed.
  4. Harden AINode restart handling: preserve local ainode_id, back up old system.properties on failure, and avoid implicit re-registration with a new AINode id.

Added unit tests for:

  • ConfigNode startup state classification
  • Consensus startup failure handling
  • Delayed ConfigNode system property persistence
  • AINode restart failure and local state preservation
  • Rejection of unsafe AINode restart with changed endpoint

This PR has:

  • been self-reviewed.
    • concurrent read
    • concurrent write
    • concurrent read and write
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods.
  • added or updated version, license, or notice information
  • added comments explaining the "why" and the intent of the code wherever would not be obvious
    for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold
    for code coverage.
  • added integration tests.
  • been tested in a test IoTDB cluster.

PTAL.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves restart robustness for ConfigNode and AINode by explicitly modeling partial startup states, tightening restart validation, and ensuring startup failures are surfaced rather than allowing unsafe recovery paths.

Changes:

  • Add explicit ConfigNode startup state classification (first start / restart / partial / corrupted) and enforce stricter startup gating.
  • Propagate consensus initialization failures and delay ConfigNode system.properties persistence until post-consensus + leader election + seed apply succeeds.
  • Harden AINode restart behavior to preserve local ainode_id, avoid implicit re-registration, and back up local system.properties on failure; add unit tests across these behaviors.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/service/ConfigNodeStartupPersistenceTest.java New tests ensuring system.properties is only persisted after successful seed apply.
iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/manager/node/NodeManagerTest.java New tests for applyConfigNode failure propagation and AINode restart update error handling.
iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/manager/node/ClusterNodeStartUtilsTest.java New tests rejecting AINode restart when endpoint changes.
iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/manager/consensus/ConsensusManagerTest.java Expanded tests for consensus start behavior on first start vs restart and failure cases.
iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/conf/SystemPropertiesUtilsTest.java New tests for startup state classification (first/restart/partial/corrupted).
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/service/ConfigNode.java Persist system properties only after successful seed apply via consensus.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/node/NodeManager.java Return/propagate failure statuses for config node apply and harden AINode restart updates.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/node/ClusterNodeStartUtils.java Add AINode endpoint-change detection during restart confirmation.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/consensus/ConsensusManager.java Make consensus start fail fast for invalid startup states and on local peer creation failures.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/conf/SystemPropertiesUtils.java Introduce explicit startup states and stricter restart detection across system+consensus storage.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/conf/ConfigNodeStartupCheck.java Enforce startup gating based on explicit startup state classification.
iotdb-core/ainode/tests/test_ai_node_restart.py New unit tests for AINode restart failure handling and local state preservation.
iotdb-core/ainode/iotdb/ainode/core/rpc/client.py Make node_restart return status and avoid implicit re-registration messaging on rejection.
iotdb-core/ainode/iotdb/ainode/core/config.py Load system.properties after applying configured system dir and preserve unregistered id default.
iotdb-core/ainode/iotdb/ainode/core/ai_node.py Add backup + atomic write helpers for system.properties and harden restart path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +126 to +128
if (file.isFile() || !file.isDirectory()) {
return LocalStorageState.PRESENT;
}
Comment on lines +338 to +341
if "ain_inference_model_mem_usage_map" in config_keys:
self._config.set_ain_inference_model_mem_usage_map(
eval(file_configs["ain_inference_model_mem_usage_map"])
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants