Send rows in binary mode for ANALYZE #601

Light-City · 2024-08-26T10:39:54Z

Change logs

Send results of select pg_catalog.gp_acquire_sample_rows query in binary mode.
That allows to avoid overflow for max double.
For example, if run the following prior to this fix:

    set extra_float_digits to 0;
    create table t (a double precision);
    insert into t values (1.7976931348623157e+308);
    analyze t;
    the following message will be printed:
    ERROR: value out of range: overflow

so cherry-pick related commits from gpdb-archive.

get correlation from segments instead of calculating it on QD (#15357)

Consider oversized rows for stawidth computation

Send rows in binary mode for ANALYZE

Why are the changes needed?

analyze and analyze func.
related test

Does this PR introduce any user-facing change?

no.

How was this patch tested?

icw test: analyze.sql

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
Learn the coding contribution guide, including our code conventions, workflow and more.
List your communication in the GitHub Issues or Discussions (if has or needed).
Document changes.
Add tests for the change
Pass make installcheck
Pass make -C src/test installcheck-cbdb-parallel
Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

We cannot use the same method as PostgreSQL does to calculate the correlation in QD. When we collect data from segments to QD, this will change the physical order of the data. such as in segment 1 the data is 1,3,5,7,9. And in segment 2 the data is 2,4,6,8,10. In each segment the data is ordered, and correlation is 1 in each segment. But after we collect the data to QD, it may be 1,3,5,2,4,7,9,6,8,10. And the correlation is 0.3 or something else and it is not stable. And this will increase the cost of index scan which is shouldn't be done. So get correlations from segments and then calculate correlation for QD. we use the weighted mean algorithm to calculate correlation on QD, However, In some situations, we may not be able to obtain reltuples of a table, such as none-leaf part of partitioned table or the parent table of the inherited table. So we can only use the mean algorithm to calculate correlation for these tables.

Send results of select pg_catalog.gp_acquire_sample_rows query in binary mode. That allows to avoid overflow for max double. For example, if run the following prior to this fix: set extra_float_digits to 0; create table t (a double precision); insert into t values (1.7976931348623157e+308); analyze t; the following message will be printed: ERROR: value out of range: overflow For text mode (default) when analyze for table is performed the master calls gp_acquire_sample_rows() helper function on each segment. That eventually calls float8out function on segment to converts float8 number to a string with snprintf: snprintf(ascii, MAXDOUBLEWIDTH + 1, "%.*g", ndig, num); When ndig is 15 the maximum float8 value 1.7976931348623157e+308 is rounded to "1.79769313486232e+308" that has no representation. And on master acquire_sample_rows_dispatcher function process gp_acquire_sample_rows result and eventually float8in function is called to convert string to float8 with strtold: val = strtold(num, &endptr); This is where overflow for "1.79769313486232e+308" happens but works fine for "1.7976931348623157e+308". Transferring in binary mode allows to avoid conversion from double to string on segments and then back to double on master. And this will much faster than before. Using CdbDispatchPlan instead of CdbDispatchCommand allows to receive data in binary mode in MemTuple, and this is much faster than before. And use tuplestore to store received tuples to avoid use too many memory. Co-authored-by: zxuejing <[email protected]>

CLAassistant · 2024-08-26T10:40:08Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ zxuejing
❌ yanwr1
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gp_bloat_diag use the given statistics(including stawidth) in pg_statistic to compute the expected number of pages for the given table. To avoid consuming too much memory during analysis and/or too much space in the resulting pg_statistic rows, ANALYZE ignores varlena datums that are wider than WIDTH_THRESHOLD. So the average width of column values is far smaller than the real average width, especially for varlena datums which are larger than WIDTH_THRESHOLD but stored uncompressed. The wrong stawidth value causes gp_bloat_diag view shows tables have bloat and require a "VACUUM" or "VACUUM FULL", although these tables don't have bloat in fact. 'oversized_cols_length' is returned as an array which contains oversized attributes' length.As the array can also be used to distinguish real NULLs from values that were too large to be included in the sample, remove 'oversized_cols_bitmap' results from the sample row. Include the total length and number of oversized rows when computing the stawidth.

zxuejing and others added 2 commits August 26, 2024 18:24

Light-City force-pushed the fix/analyze_fix branch 2 times, most recently from 684e4e3 to d3948c6 Compare August 26, 2024 15:32

Light-City force-pushed the fix/analyze_fix branch from d42e4e8 to 860dab7 Compare August 26, 2024 23:30

my-ship-it requested a review from jiaqizho August 28, 2024 01:06

my-ship-it approved these changes Aug 28, 2024

View reviewed changes

my-ship-it merged commit 87ec835 into apache:main Aug 28, 2024
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send rows in binary mode for ANALYZE #601

Send rows in binary mode for ANALYZE #601

Light-City commented Aug 26, 2024

CLAassistant commented Aug 26, 2024 •

edited

Loading

Send rows in binary mode for ANALYZE #601

Send rows in binary mode for ANALYZE #601

Conversation

Light-City commented Aug 26, 2024

Change logs

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Contributor's Checklist

CLAassistant commented Aug 26, 2024 • edited Loading

CLAassistant commented Aug 26, 2024 •

edited

Loading