[Feature][Elastic search] Support multi-table source feature #7502

FuYouJ · 2024-08-26T16:02:11Z

Purpose of this pull request

Related issue [Feature][Elastic search] Support multi-table source feature #6730
Add Chinese document
E2E allows spark to synchronize multiple tables,add more null value test case in e2e
The configuration examples that included schema have been removed from the documentation, as it is strongly discouraged to use them.

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config
Update the release-note.

This reverts commit 26d0473.

Carl-Zhou-CN · 2024-08-27T01:56:18Z

...arch-e2e/src/test/java/org/apache/seatunnel/e2e/connector/elasticsearch/ElasticsearchIT.java

+    @DisabledOnContainer(
+            value = {},
+            type = {EngineType.SPARK, EngineType.FLINK},
+            disabledReason = "Currently SPARK/FLINK do not support multiple table read")


spark already supports multiple tables. You need to enable test case

…abled

…exDocsCount

…type_data.json file

Carl-Zhou-CN · 2024-08-31T05:06:43Z

...arch-e2e/src/test/java/org/apache/seatunnel/e2e/connector/elasticsearch/ElasticsearchIT.java

+        return getDocsWithTransformDate(source, index, Collections.emptyList());
+    }
+
+    //


Carl-Zhou-CN · 2024-08-31T05:28:04Z

...java/org/apache/seatunnel/connectors/seatunnel/elasticsearch/source/ElasticsearchSource.java

            Map<String, BasicTypeDefine<EsType>> esFieldType =
-                    esRestClient.getFieldTypeMapping(config.get(SourceConfig.INDEX), source);
+                    esRestClient.getFieldTypeMapping(index, source);
            esRestClient.close();


Is there a risk that the connection is not closed

I implemented Closeable for EsRestClient and extracted the code into a separate method to ensure that resources are closed even if an exception occurs

private Map<String, BasicTypeDefine<EsType>> getFieldTypeMapping( String index, List<String> source) { // EsRestClient#getFieldTypeMapping may throw runtime exception // so here we use try-resources-finally to close the resource try (EsRestClient esRestClient = EsRestClient.createInstance(connectionConfig)) { return esRestClient.getFieldTypeMapping(index, source); } }

Carl-Zhou-CN · 2024-08-31T05:29:33Z

...arch-e2e/src/test/java/org/apache/seatunnel/e2e/connector/elasticsearch/ElasticsearchIT.java

-                esRestClient.getFieldTypeMapping("st_index4", Lists.newArrayList());
-        Thread.sleep(2000);
+                esRestClient.getFieldTypeMapping(indexName, Lists.newArrayList());
+        Thread.sleep(5000);


Can unify a parameter

Now all refresh waiting times use static variables.

private static final long INDEX_REFRESH_MILL_DELAY = 5000L;

Carl-Zhou-CN

LGTM

hailin0 · 2024-09-01T14:03:03Z

...-elasticsearch-e2e/src/test/resources/elasticsearch/elasticsearch_multi_source_and_sink.conf

+    tls_verify_hostname = false
+
+    index = "multi_source_write_test_index"
+    index_type = "st"


You should test this feature using multi-table writes

check read/write tables row & column

https://github.com/apache/seatunnel/pull/7052/files#diff-9c8de872b906856b7dccdbab2dbe4a0ca089be7ce5fd89097aeec8ba82540c27

Based on your suggestion, I modified my e2e test.
Now it reads different fields from different indices and writes them into different target indices.

env { parallelism = 1 job.mode = "BATCH" #checkpoint.interval = 10000 } source { Elasticsearch { hosts = ["https://elasticsearch:9200"] username = "elastic" password = "elasticsearch" tls_verify_certificate = false tls_verify_hostname = false index_list = [ { index = "read_filter_index1" query = {"range": {"c_int": {"gte": 10, "lte": 20}}} source = [ c_map, c_array, c_string, c_boolean, c_tinyint, c_smallint, c_bigint, c_float, c_double, c_decimal, c_bytes, c_int, c_date, c_timestamp, c_null ] array_column = { c_array = "array<tinyint>" } } { index = "read_filter_index2" query = {"range": {"c_int2": {"gte": 10, "lte": 20}}} source = [ c_int2, c_null2, c_date2 ] } ] } } transform { } sink { Elasticsearch { hosts = ["https://elasticsearch:9200"] username = "elastic" password = "elasticsearch" tls_verify_certificate = false tls_verify_hostname = false index = "${table_name}_copy" index_type = "st" "schema_save_mode"="CREATE_SCHEMA_WHEN_NOT_EXIST" "data_save_mode"="APPEND_DATA" } }

hailin0 · 2024-09-01T14:07:35Z

...-elasticsearch-e2e/src/test/resources/elasticsearch/elasticsearch_multi_source_and_sink.conf

+           c_map,
+           c_array,
+           c_string,
+           c_boolean,
+           c_tinyint,
+           c_smallint,
+           c_bigint,
+           c_float,
+           c_double,
+           c_decimal,
+           c_bytes,
+           c_int,
+           c_date,
+           c_timestamp
+           ]
+           array_column = {
+           c_array = "array<tinyint>"
+           }
+       }
+       {
+           index = "read_index2"
+           query = {"range": {"c_int": {"gte": 10, "lte": 20}}}
+           source = [
+           c_map,
+           c_array,
+           c_string,
+           c_boolean,
+           c_tinyint,
+           c_smallint,
+           c_bigint,
+           c_float,
+           c_double,
+           c_decimal,
+           c_bytes,
+           c_int,
+           c_date,
+           c_timestamp


You should define indexes using different fields

e.g

index_1: x, y, z... index_2: a,b,c...

Based on your suggestion, I modified my e2e test.
Now it reads different fields from different indices and writes them into different target indices.

env { parallelism = 1 job.mode = "BATCH" #checkpoint.interval = 10000 } source { Elasticsearch { hosts = ["https://elasticsearch:9200"] username = "elastic" password = "elasticsearch" tls_verify_certificate = false tls_verify_hostname = false index_list = [ { index = "read_filter_index1" query = {"range": {"c_int": {"gte": 10, "lte": 20}}} source = [ c_map, c_array, c_string, c_boolean, c_tinyint, c_smallint, c_bigint, c_float, c_double, c_decimal, c_bytes, c_int, c_date, c_timestamp, c_null ] array_column = { c_array = "array<tinyint>" } } { index = "read_filter_index2" query = {"range": {"c_int2": {"gte": 10, "lte": 20}}} source = [ c_int2, c_null2, c_date2 ] } ] } } transform { } sink { Elasticsearch { hosts = ["https://elasticsearch:9200"] username = "elastic" password = "elasticsearch" tls_verify_certificate = false tls_verify_hostname = false index = "${table_name}_copy" index_type = "st" "schema_save_mode"="CREATE_SCHEMA_WHEN_NOT_EXIST" "data_save_mode"="APPEND_DATA" } }

…ifferent target indexes

hailin0 · 2024-09-03T06:13:28Z

...g/apache/seatunnel/connectors/seatunnel/elasticsearch/source/ElasticsearchSourceFactory.java

@@ -55,7 +52,7 @@ public String factoryIdentifier() {
    @Override
    public OptionRule optionRule() {
        return OptionRule.builder()
-                .required(HOSTS, INDEX)


Is it compatible with old versions?

Fully compatible with previous configurations, in the ElasticsearchSource code, the first if step is to check for the existence of index_ist. If it exists, it should be parsed synchronously as multiple tables, and if it does not exist, it should be parsed synchronously as a single table

public ElasticsearchSource(ReadonlyConfig config) { this.connectionConfig = config; boolean multiSource = config.getOptional(SourceConfig.INDEX_LIST).isPresent(); boolean singleSource = config.getOptional(SourceConfig.INDEX).isPresent(); if (multiSource && singleSource) { log.warn( "Elasticsearch Source config warn: when both 'index' and 'index_list' are present in the configuration, only the 'index_list' configuration will take effect"); } if (!multiSource && !singleSource) { throw new ElasticsearchConnectorException( ElasticsearchConnectorErrorCode.SOURCE_CONFIG_ERROR_01, ElasticsearchConnectorErrorCode.SOURCE_CONFIG_ERROR_01.getDescription()); } if (multiSource) { this.sourceConfigList = createMultiSource(config); } else { this.sourceConfigList = Collections.singletonList(parseOneIndexQueryConfig(config)); } }

docs/en/connector-v2/source/Elasticsearch.md

hailin0 · 2024-09-03T06:16:50Z

What will happen if I configure index & index_list connector at the same time

FuYouJ · 2024-09-03T06:23:57Z

What will happen if I configure index & index_list connector at the same time

The program will print a warning log to tell the user the configured processing priority。

f (multiSource && singleSource) {
            log.warn(
                    "Elasticsearch Source config warn: when both 'index' and 'index_list' are present in the configuration, only the 'index_list' configuration will take effect");
        }

…he document index_list

hailin0

LGTM. Thanks @FuYouJ

Waiting for ci passed

Carl-Zhou-CN

+1

Carl-Zhou-CN · 2024-09-04T05:58:05Z

Thank you for your contribution @FuYouJ

FuYouJ added 4 commits August 26, 2024 23:26

[improve][elasticsSearch]source support multiSource

2d55adc

Merge branch 'dev' into esMultiSource

22765c1

[improve][elasticsSearch]source support multiSource,update docs

1d6c8d7

update paimon.apache.org deadlink

26d0473

github-actions bot added document connectors-v2 e2e elasticsearch labels Aug 26, 2024

Revert "update paimon.apache.org deadlink"

2fbbfa7

This reverts commit 26d0473.

Carl-Zhou-CN reviewed Aug 27, 2024

View reviewed changes

FuYouJ added 5 commits August 27, 2024 21:25

Merge branch 'dev' into esMultiSource

eb6a269

[improve][elasticsSearch][document] add source chinese document

8e45aa6

[improve][elasticsSearch]spark multi-table synchronization test is en…

849fe75

…abled

[improve][elasticsSearch]docs remove schema demo,e2eTest add null case

c348e7c

[improve][elasticsSearch]use checkIndexExist method instead of getInd…

50a5da4

…exDocsCount

FuYouJ requested a review from Carl-Zhou-CN August 30, 2024 01:59

Hisoka-X linked an issue Aug 30, 2024 that may be closed by this pull request

[Feature][Elastic search] Support multi-table source feature #6730

Closed

3 tasks

[improve][elasticsSearch]Undo the modifications to the st_index_full_…

aa900c5

…type_data.json file

Carl-Zhou-CN reviewed Aug 31, 2024

View reviewed changes

FuYouJ added 4 commits August 31, 2024 13:38

[improve][elasticsSearch] index refresh time as a static variable

2860a9f

[improve][elasticsSearch]remove useless //

9860d52

[improve][elasticsSearch]multiple indexes use different test data sets

4925fb6

[improve][elasticsSearch]EsRestClient implements Closeable

95e923b

FuYouJ requested a review from Carl-Zhou-CN August 31, 2024 07:28

Carl-Zhou-CN previously approved these changes Sep 1, 2024

View reviewed changes

github-actions bot added approved reviewed labels Sep 1, 2024

hailin0 reviewed Sep 1, 2024

View reviewed changes

FuYouJ added 2 commits September 2, 2024 20:53

Merge branch 'dev' into esMultiSource

c0e657d

[improve][elasticsSearch]Read different fields and write indexes to d…

096fbfd

…ifferent target indexes

FuYouJ dismissed Carl-Zhou-CN’s stale review via 096fbfd September 2, 2024 13:31

github-actions bot removed approved reviewed labels Sep 2, 2024

FuYouJ added 3 commits September 2, 2024 21:33

[improve][elasticsSearch]Read different fields and write indexes to d…

470fab3

…ifferent target indexes

[improve][elasticsSearch]Read different fields and write indexes to d…

7af4eac

…ifferent target indexes

[improve][elasticsSearch]add thread sleep code to ensure test stability

a29b525

FuYouJ requested review from hailin0 and Carl-Zhou-CN September 2, 2024 17:19

hailin0 reviewed Sep 3, 2024

View reviewed changes

FuYouJ added 2 commits September 3, 2024 22:05

Merge remote-tracking branch 'origin/dev' into esMultiSource

a477e6a

[improve][elasticsSearch]add warning log, adjust the description of t…

c9a2cc0

…he document index_list

hailin0 approved these changes Sep 3, 2024

View reviewed changes

github-actions bot added approved reviewed labels Sep 3, 2024

FuYouJ requested a review from hailin0 September 3, 2024 15:49

Carl-Zhou-CN approved these changes Sep 4, 2024

View reviewed changes

Carl-Zhou-CN merged commit 29fbeb2 into apache:dev Sep 4, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][Elastic search] Support multi-table source feature #7502

[Feature][Elastic search] Support multi-table source feature #7502

FuYouJ commented Aug 26, 2024 •

edited

Loading

Carl-Zhou-CN Aug 27, 2024

FuYouJ Aug 31, 2024

Carl-Zhou-CN Aug 31, 2024

FuYouJ Aug 31, 2024

Carl-Zhou-CN Aug 31, 2024

FuYouJ Aug 31, 2024

Carl-Zhou-CN Aug 31, 2024

FuYouJ Aug 31, 2024 •

edited

Loading

Carl-Zhou-CN left a comment

hailin0 Sep 1, 2024

FuYouJ Sep 2, 2024 •

edited

Loading

hailin0 Sep 1, 2024

FuYouJ Sep 2, 2024 •

edited

Loading

hailin0 Sep 3, 2024

FuYouJ Sep 3, 2024 •

edited

Loading

hailin0 commented Sep 3, 2024

FuYouJ commented Sep 3, 2024 •

edited

Loading

hailin0 left a comment •

edited

Loading

Carl-Zhou-CN left a comment

Carl-Zhou-CN commented Sep 4, 2024

[Feature][Elastic search] Support multi-table source feature #7502

[Feature][Elastic search] Support multi-table source feature #7502

Conversation

FuYouJ commented Aug 26, 2024 • edited Loading

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FuYouJ Aug 31, 2024 • edited Loading

Choose a reason for hiding this comment

Carl-Zhou-CN left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FuYouJ Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FuYouJ Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FuYouJ Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

hailin0 commented Sep 3, 2024

FuYouJ commented Sep 3, 2024 • edited Loading

hailin0 left a comment • edited Loading

Choose a reason for hiding this comment

Carl-Zhou-CN left a comment

Choose a reason for hiding this comment

Carl-Zhou-CN commented Sep 4, 2024

FuYouJ commented Aug 26, 2024 •

edited

Loading

FuYouJ Aug 31, 2024 •

edited

Loading

FuYouJ Sep 2, 2024 •

edited

Loading

FuYouJ Sep 2, 2024 •

edited

Loading

FuYouJ Sep 3, 2024 •

edited

Loading

FuYouJ commented Sep 3, 2024 •

edited

Loading

hailin0 left a comment •

edited

Loading