Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8780][RFC-83][WIP] Incremental Table Service #12601

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

zhangyue19921010
Copy link
Contributor

Change Logs

In Hudi, when scheduling Compaction and Clustering, the default behavior is to scan all partitions under the current table. When there are many historical partitions, such as 640,000 in our production environment, this scanning and planning operation becomes very inefficient. For Flink, it often leads to checkpoint timeouts, resulting in data delays.
As for cleaning, we already have the ability to do cleaning for incremental partitions.

This RFC will draw on the design of Incremental Clean to generalize the capability of processing incremental partitions to all table services, such as Clustering and Compaction.

Impact

compaction and clustering

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Jan 8, 2025
@TheR1sing3un
Copy link
Member

@zhangyue19921010 Hi, judging from the rfc content, the goal this time is to use incremental compaction and clustering. Should we consider incorporating the incremental processing logic for clean as well? By the way, whether we can solve this problem: #11647 , it seems that we can solve it at the time of implementation.

@zhangyue19921010
Copy link
Contributor Author

Hi @TheR1sing3un , Thanks for your attention.

@zhangyue19921010 Hi, judging from the rfc content, the goal this time is to use incremental compaction and clustering. Should we consider incorporating the incremental processing logic for clean as well?

Sure thing we can build a unified incremental policy. But as we know, clean action is not a Strategy-Coding-Style-Action, even though we have many different cleaning strategies(clean by commits or clean by versions). So that we may need to reconstruct the clean plan phase and abstract it into different strategy objects. We can do it in the next PR.

By the way, whether we can solve this problem: #11647 , it seems that we can solve it at the time of implementation.

Unfortunately, this PR shouldn't solve the problem. This PR solves the problem of incremental processing, that is, how to process only incremental partitions next time after the last table service is completed. #11647 focuses on how to trigger the first action in an elegant way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants