Skip to content

feat(core): introduce snapshot cleanup operations#57

Open
zjw1111 wants to merge 1 commit into
apache:mainfrom
zjw1111:migrate/clean-files
Open

feat(core): introduce snapshot cleanup operations#57
zjw1111 wants to merge 1 commit into
apache:mainfrom
zjw1111:migrate/clean-files

Conversation

@zjw1111

@zjw1111 zjw1111 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #xxx

Introduce snapshot expiration and orphan file cleanup operations from the Alibaba Paimon C++ repository.

Migrated files:

  • include/paimon/orphan_files_cleaner.h
  • src/paimon/core/operation/expire_snapshots.h
  • src/paimon/core/operation/expire_snapshots.cpp
  • src/paimon/core/operation/expire_snapshots_test.cpp
  • src/paimon/core/operation/orphan_files_cleaner.cpp
  • src/paimon/core/operation/orphan_files_cleaner_impl.h
  • src/paimon/core/operation/orphan_files_cleaner_impl.cpp
  • src/paimon/core/operation/orphan_files_cleaner_test.cpp
  • src/paimon/core/operation/metrics/clean_metrics.h

The dependency closure is intentionally not included in this PR.

Tests

  • git diff --cached --check
  • git diff --check

API and Format

This change adds the public orphan_files_cleaner.h API. It does not change storage format or protocol.

Documentation

No documentation changes.

Generative AI tooling

Migrated-by: OpenAI Codex

Copilot AI review requested due to automatic review settings June 8, 2026 09:11
@zjw1111

zjw1111 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @chongchongxiao for your previous contributions to paimon-cpp. This migration PR carries that work forward into Apache Paimon C++.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a C++ orphan files cleaning API/implementation with metrics, along with snapshot expiration logic and accompanying unit tests.

Changes:

  • Added OrphanFilesCleaner public API (CleanContext + builder) and OrphanFilesCleanerImpl implementation to detect and delete orphan files.
  • Added cleaning metrics (CleanMetrics) and unit tests for the orphan cleaner behavior/unsupported-table cases.
  • Added ExpireSnapshots implementation and tests for snapshot expiration and manifest/data-file deletion.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/paimon/core/operation/orphan_files_cleaner_test.cpp New unit tests covering orphan cleaner patterns and unsupported table scenarios.
src/paimon/core/operation/orphan_files_cleaner_impl.h Declares the orphan cleaner implementation (internal).
src/paimon/core/operation/orphan_files_cleaner_impl.cpp Implements orphan file discovery/deletion and metrics emission.
src/paimon/core/operation/orphan_files_cleaner.cpp Implements CleanContextBuilder and OrphanFilesCleaner::Create.
src/paimon/core/operation/metrics/clean_metrics.h Adds metric keys for the orphan cleaning operation.
src/paimon/core/operation/expire_snapshots_test.cpp New unit tests for snapshot expiration edge cases and data-file deletion mapping.
src/paimon/core/operation/expire_snapshots.h Declares the snapshot expiration operation.
src/paimon/core/operation/expire_snapshots.cpp Implements snapshot expiration, manifest cleanup, and empty-directory cleanup.
include/paimon/orphan_files_cleaner.h Adds the public orphan files cleaner API and docs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

std::set<std::string> ListFileDirs(const std::string& path, int32_t max_level) const;
Result<std::set<std::string>> GetUsedFiles() const;
Result<std::set<std::string>> GetUsedFilesBySnapshot(const Snapshot& snapshot) const;
static bool SupportToClean(const std::string& file_name);
}
PAIMON_ASSIGN_OR_RAISE(std::set<std::string> used_file_names, GetUsedFiles());

Duration duration;
Comment on lines +112 to +113
uint64_t file_statuses_duration = duration.Reset();
for (const auto& file_statuses : CollectAll(file_statuses_futures)) {
Comment on lines +133 to +137
need_to_deletes.insert(path);
futures.push_back(Via(executor_.get(), [this, path]() {
auto s = fs_->Delete(path, /*recursive=*/false);
(void)s;
}));
Comment on lines +197 to +202
for (const auto& empty_directory : to_delete_empty_directories) {
futures.push_back(Via(executor_.get(), [this, &empty_directory] {
auto ret = TryDeleteEmptyDirectory(empty_directory);
(void)ret;
}));
}
Comment on lines +325 to +329
// TODO(jinli.zjw): skip index manifests
if (snapshot.IndexManifest() && snapshot.IndexManifest().value() != "") {
assert(false);
return Status::NotImplemented("do not support expire snapshot with index manifest");
}
}
}

for (int32_t hierarchy = deduplicate.size() - 1; hierarchy >= 0; hierarchy--) {
Comment on lines +78 to +81
if (max_deletes < 0) {
return Status::Invalid(fmt::format(
"Expire failed: snapshot max delete num '{}' must be greater than 0", max_deletes));
}
Comment on lines +247 to +249
std::vector<std::string> to_delete_manifests;
// TODO(jinli.zjw): optimize for async
for (const auto& manifest_file_meta : manifest_file_metas) {

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for migrating this feature. I think we need a couple of correctness fixes before merging:

  1. snapshot.time-retained is checked against the current candidate snapshot instead of its next snapshot. Snapshot N should only be expired after snapshot N + 1 has been retained long enough. With the current loop, if snapshot 5 is still within the retention window, ExpireUntil(..., 5) can still delete snapshot 4, even though its successor is not old enough yet. Please align this with the Java implementation by checking snapshot_id + 1 and add a regression test for the boundary.

  2. CleanEmptyDirectories schedules async deletion tasks that capture the range-for variable by reference ([this, &empty_directory]). Those tasks may execute after the loop variable has advanced or gone out of scope, so this is a use-after-scope risk and may delete the wrong directory. Please capture the path by value, e.g. [this, empty_directory], and add coverage if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants