Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improve][Kafka] kafka source refactored some reader read logic #6408

Merged
merged 14 commits into from
Sep 5, 2024

Conversation

Carl-Zhou-CN
Copy link
Member

An attempt to improve kafka source with reference to flink kafka source

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

@Carl-Zhou-CN Carl-Zhou-CN marked this pull request as ready for review May 13, 2024 07:40
@Carl-Zhou-CN Carl-Zhou-CN changed the title [Improve] kafka source improve [Improve][Kafka] kafka source refactored some reader read logic May 13, 2024
@Carl-Zhou-CN
Copy link
Member Author

@hailin0
Copy link
Member

hailin0 commented Jun 15, 2024

Please fix the conflict

@Carl-Zhou-CN
Copy link
Member Author

Please fix the conflict

done

@Carl-Zhou-CN
Copy link
Member Author

@hailin0

@Hisoka-X Hisoka-X added this to the 2.3.8 milestone Aug 23, 2024
Comment on lines 40 to 41
// ------------------------------------------------------------------------

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ------------------------------------------------------------------------

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*/
package org.apache.seatunnel.common.utils;

public final class TemporaryClassLoaderContext implements AutoCloseable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this tool!

}

private void enqueueTaskUnsafe(SplitFetcherTask task) {
assert lock.isHeldByCurrentThread();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
I believe that this ReentrantLock usage is the same as depicted in the diagram, only occurring during testing and debugging. It can be disabled in production, but of course, it can also be enabled for added safety.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@Hisoka-X
Copy link
Member

Please share more information about why we need refactor kafka? And what we can get after refactor?

@Hisoka-X Hisoka-X removed this from the 2.3.8 milestone Aug 24, 2024
@hailin0
Copy link
Member

hailin0 commented Aug 24, 2024

Please share more information about why we need refactor kafka? And what we can get after refactor?

+1

@Carl-Zhou-CN
Copy link
Member Author

Please share more information about why we need refactor kafka? And what we can get after refactor?

Please share more information about why we need refactor kafka? And what we can get after refactor?

Firstly, I believe there is room for improvement in the previous approach where a new consumer thread is created for each split, and each consumer thread polls in a loop. Enhancements could involve having one consumer thread per parallelism level, ensuring clearer offset management, and promptly removing consumed partitions to reduce overhead.
Secondly, a more seamless integration of streaming and batch processing can be achieved, eliminating the need for consumer threads to differentiate between streaming and batch behaviors.

@Hisoka-X
Copy link
Member

Firstly, I believe there is room for improvement in the previous approach where a new consumer thread is created for each split, and each consumer thread polls in a loop. Enhancements could involve having one consumer thread per parallelism level, ensuring clearer offset management, and promptly removing consumed partitions to reduce overhead.
Secondly, a more seamless integration of streaming and batch processing can be achieved, eliminating the need for consumer threads to differentiate between streaming and batch behaviors.

Make sense to me. Thanks @Carl-Zhou-CN !

@Hisoka-X Hisoka-X added this to the 2.3.8 milestone Aug 26, 2024
@Carl-Zhou-CN Carl-Zhou-CN requested a review from Hisoka-X August 26, 2024 07:27
@Hisoka-X
Copy link
Member

Checking...

},
{
rule_type = MAX
rule_value = 149
rule_value = 99
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for change test case config?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original test case did not play a role because he did not read the data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is no suitable comment. The purpose of this test case is not to read any data and ensure that the offset can still be submitted to Kafka normally. Please refer #7312
cc @hailin0

@@ -234,4 +234,19 @@ private void addTaskUnsafe(SplitFetcherTask task) {
taskQueue.add(task);
nonEmpty.signal();
}

public void enqueueTask(SplitFetcherTask task) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, the new method is redundant

@github-actions github-actions bot added the kafka label Aug 28, 2024
Hisoka-X
Hisoka-X previously approved these changes Aug 30, 2024
Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Carl-Zhou-CN

} else {
deserializationSchema.deserialize(consumerRecord.value(), outputCollector);
}
splitState.setCurrentOffset(consumerRecord.offset() + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need +1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment

Copy link
Member

@hailin0 hailin0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@hailin0 hailin0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if ci passes. Thanks @Carl-Zhou-CN !

@hailin0 hailin0 merged commit 10598b6 into apache:dev Sep 5, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants