-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incubating Program: TiDB Doris connector #487
Comments
|
LGTM |
Realized that there is dependency to TiDB binlog, since TiCDC is going to replace binlog eventually. Can we start with TiCDC instead? |
Of course, you can use any way you want, binlog is just an example |
Cool, this is my only comment. Rest LGTM. |
TiDB Doris connector
Describe the project you want to incubate:
Summary
TiDB and Doris are both decent opensource relational data database. TiDB is a fusion distributed database product, while Doris is more focused on the OLAP field. If TiDB can hand over part of the OLAP analysis task to Doris, it can realize efficient large-scale OLAP analysis with the help of Doris's materialized view, aggregation model and other advantages. Doris design documentation and related content can be found on the official website of Doris.
EXPLANATION OF THE DORIS VOCABULARY
MAIN MODULES
In Doris, the streaming import functionality is mainly divided into the following layers
Parsing
This layer includes the HTTP request parsing on BE and the generation portion of the import plan.
Executive
This layer includes the execution process of the import plan on BE, including the process of reading, converting and distributing data.
Transaction management
Any streaming import is treated as an atomic transaction in Doris. This layer is mainly responsible for the start, commit, and rollback of the import transaction to ensure the atomic of the import.
Storage
This layer includes the steps to store the imported data after BE receives it, which will not be specifically described in this article.
DATA MODEL
In Doris, data is logically described in the form of tables. A table contains rows and columns. A Row is a Row of data for the user. Column is used to describe the different fields in a Row of data.
Column can be divided into two broad categories: Key and Value. From a business perspective, Key and Value can correspond to dimension columns and metric columns, respectively.
The data model of Doris is mainly divided into three categories:
Aggregate
Aggregation model. When we import the data, the same rows for the Key column are aggregated into one row, while the Value column is aggregated according to the set AggregationType. There are currently four types in AggregationType:
Uniq
In some multi-dimensional analysis scenarios, users are more concerned about how to ensure the uniqueness of Key, that is, how to obtain the uniqueness constraint of Primary Key. Therefore, we introduce the data model of UNIQ. This model is essentially a special case of the aggregate model and a simplified representation of the table structure.
Duplicate
The data has neither primary keys nor aggregation requirements.
IMPORT DATA
There are many ways to import data in Doris. There are two ways that are directly related to this Hackathon: Stream Load and RoutineLoad.
Stream load
The user submits the request over the HTTP protocol and creates the import with the raw data. Mainly used to quickly import data from a local file or data stream into the Doris. The import command returns the import results synchronously. Reference documentation
Routine load
The user submits a routine import job via the MySQL protocol, spawns a permanent thread that continuously reads data from a data source (such as Kafka) and imports it into Doris. Reference documentation
In a variety of storage formats, Aggregate and Duplicate only support INSERT and DELETE, while UNIQ supports batch addition and update through import, which can realize data modification in another way
The data source can choose TIDB binlog and read the binlog to synchronize the data incrementally into the Doris.
DESIGN
When importing data, you can choose the storage model of the table according to the actual situation. If only the newly added table can be any model, and if the data is modified or deleted, it is recommended to use the UNIQ model, and use the Batch Delete function to add a flag column to identify the operation type of the data. You can refer to the documentation for more detailed information.
There are many kinds of implementation schemes. The following design ideas are given.
Stream Load Scheme
This method is relatively simple to implement and design. It can start or design an independent service in TIDB, read the TIDB binlog file on time and regularly, parse the contents of the binlog, assemble the data rows into the CSV format supported by Doris, and import them into Doris through stream load.
Routine Load scheme
This approach uses TIDB's Drainer to synchronize the binlog to Kafka, and a new TiDB binlog data format is added to Doris to synchronize the data.
TIDB native protocol synchronization scheme
The synchronization protocol of TIDB replica is implemented in Doris, which disguises Doris as a node of TIDB cluster, takes over the data synchronization request from TIDB, parses the data format, and writes the data into Doris
Your RFC/Proposal?
The text was updated successfully, but these errors were encountered: