Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use md5 checksum instead of crc32 #787

Merged
merged 12 commits into from
Aug 9, 2024

Conversation

michaelmdeng
Copy link
Contributor

@michaelmdeng michaelmdeng commented May 7, 2024

What problem does this PR solve?

Issue Number: close #634 #703

What is changed and how it works?

Change checksum algorithm to use md5 instead of crc32 to minimize chance of collision.

Essentially a copy of #707 that addresses merge conflicts and simplifies md5 checksum. Since md5 produces a 128-bit checksum, previous iteration attempted to track the checksum as two uint64s for lhs/rhs. This change simplifies the checksum query to bit_xor the lhs and the rhs into a single uint64 output.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

  • Has exported function/method change
  • Has exported variable/fields change
  • Has interface methods change
  • Has persistent data change

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Some previous concerns about pingcap/tidb#39576 and optimizing plan for checksum query rather than performing the same checksum operation twice for each lhs/rhs. Here's the explain output of the checksum query against a test table. I believe this shows that planner is pushing down the checksum functions down to the tikv coprocessor.

> explain SELECT COUNT(*) as CNT, BIT_XOR(CAST(CONV(SUBSTRING(MD5(CONCAT_WS(',', `id`, `name`, CONCAT(isnull(`id`), isnull(`name`)))), 1, 16), 16, 10) AS UNSIGNED)) LMD5, BIT_XOR(CAST(CONV(SUBSTRING(MD5(CONCAT_WS(',', `id`, `name`, CONCAT(isnull(`id`), isnull(`name`)))), 17, 16), 16, 10) AS UNSIGNED)) RMD5 FROM mysql_testing_primary.clusters;
+----------------------------+---------------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                         | estRows       | task      | access object                                      | operator info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+----------------------------+---------------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| StreamAgg_20               | 1.00          | root      |                                                    | funcs:count(Column#14)->Column#8, funcs:bit_xor(Column#15)->Column#9, funcs:bit_xor(Column#16)->Column#10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| └─IndexReader_21           | 1.00          | root      |                                                    | index:StreamAgg_8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|   └─StreamAgg_8            | 1.00          | cop[tikv] |                                                    | funcs:count(1)->Column#14, funcs:bit_xor(cast(conv(substring(md5(concat_ws(",", cast(mysql_testing_primary.clusters.id, var_string(20)), mysql_testing_primary.clusters.name, concat("0", cast(isnull(mysql_testing_primary.clusters.name), var_string(20))))), 1, 16), 16, 10), bigint(22) UNSIGNED BINARY))->Column#15, funcs:bit_xor(cast(conv(substring(md5(concat_ws(",", cast(mysql_testing_primary.clusters.id, var_string(20)), mysql_testing_primary.clusters.name, concat("0", cast(isnull(mysql_testing_primary.clusters.name), var_string(20))))), 17, 16), 16, 10), bigint(22) UNSIGNED BINARY))->Column#16 |
|     └─IndexFullScan_19     | 3027392465.00 | cop[tikv] | table:clusters, index:index_clusters_on_name(name) | keep order:false                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+----------------------------+---------------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.08 sec)

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation
  • Need to be included in the release note

@CLAassistant
Copy link

CLAassistant commented May 8, 2024

CLA assistant check
All committers have signed the CLA.

@michaelmdeng michaelmdeng force-pushed the mdeng/md5-checksum branch from 15a8a93 to dff78be Compare May 8, 2024 15:00
@ti-chi-bot ti-chi-bot bot added size/M and removed size/L labels May 8, 2024
@fzzf678 fzzf678 requested review from Leavrth, 3pointer and dveeden and removed request for dveeden May 23, 2024 16:17
@ti-chi-bot ti-chi-bot bot added the lgtm label Aug 9, 2024
Copy link

ti-chi-bot bot commented Aug 9, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-08-09 03:08:29.842323853 +0000 UTC m=+580639.709422926: ☑️ agreed by Leavrth.

@Leavrth
Copy link
Contributor

Leavrth commented Aug 9, 2024

/merge

Copy link

ti-chi-bot bot commented Aug 9, 2024

@Leavrth: We have migrated to builtin LGTM and approve plugins for reviewing.

👉 Please use /approve when you want approve this pull request.

The changes announcement: LGTM plugin changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@Leavrth
Copy link
Contributor

Leavrth commented Aug 9, 2024

/test all

@Leavrth
Copy link
Contributor

Leavrth commented Aug 9, 2024

/test unit-test

Copy link

ti-chi-bot bot commented Aug 9, 2024

@Leavrth: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-verify

Use /test all to run all jobs.

In response to this:

/test unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Leavrth
Copy link
Contributor

Leavrth commented Aug 9, 2024

/test pull-unit-test

Copy link

ti-chi-bot bot commented Aug 9, 2024

@Leavrth: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-verify

Use /test all to run all jobs.

In response to this:

/test pull-unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lance6716
Copy link
Collaborator

/test pull-verify

@Leavrth
Copy link
Contributor

Leavrth commented Aug 9, 2024

/LGTM

Copy link

ti-chi-bot bot commented Aug 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Leavrth

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wuhuizuo
Copy link
Contributor

wuhuizuo commented Aug 9, 2024

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test label Aug 9, 2024
@ti-chi-bot ti-chi-bot bot merged commit 21a5788 into pingcap:master Aug 9, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BIT_ XOR function has a collision example that is easy to occur
6 participants