Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send rows in binary mode for ANALYZE #601

Merged
merged 3 commits into from
Aug 28, 2024

Conversation

Light-City
Copy link
Contributor

fix #569


Change logs

Send results of select pg_catalog.gp_acquire_sample_rows query in binary mode.
That allows to avoid overflow for max double.
For example, if run the following prior to this fix:

    set extra_float_digits to 0;
    create table t (a double precision);
    insert into t values (1.7976931348623157e+308);
    analyze t;
    the following message will be printed:
    ERROR: value out of range: overflow

so cherry-pick related commits from gpdb-archive.

get correlation from segments instead of calculating it on QD (#15357)

Consider oversized rows for stawidth computation

Send rows in binary mode for ANALYZE

Why are the changes needed?

  • analyze and analyze func.
  • related test

Does this PR introduce any user-facing change?

no.

How was this patch tested?

icw test: analyze.sql

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

  • Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
  • Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
  • Learn the coding contribution guide, including our code conventions, workflow and more.
  • List your communication in the GitHub Issues or Discussions (if has or needed).
  • Document changes.
  • Add tests for the change
  • Pass make installcheck
  • Pass make -C src/test installcheck-cbdb-parallel
  • Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

zxuejing and others added 2 commits August 26, 2024 18:24
We cannot use the same method as PostgreSQL does to calculate the correlation in QD.
When we collect data from segments to QD, this will change the physical order of the data.
such as in segment 1 the data is 1,3,5,7,9. And in segment 2 the data is 2,4,6,8,10.
In each segment the data is ordered, and correlation is 1 in each segment.
But after we collect the data to QD, it may be 1,3,5,2,4,7,9,6,8,10. And the correlation
is 0.3 or something else and it is not stable.
And this will increase the cost of index scan which is shouldn't be done.

So get correlations from segments and then calculate correlation for QD.

we use the weighted mean algorithm to calculate correlation on QD,
However, In some situations, we may not be able to obtain reltuples of a table,
such as none-leaf part of partitioned table or the parent table of the
inherited table. So we can only use the mean algorithm to calculate correlation for these tables.
Send results of select pg_catalog.gp_acquire_sample_rows query in binary mode.
That allows to avoid overflow for max double.

For example, if run the following prior to this fix:

set extra_float_digits to 0;
create table t (a double precision);
insert into t values (1.7976931348623157e+308);
analyze t;
the following message will be printed:
ERROR: value out of range: overflow

For text mode (default) when analyze for table is performed the
master calls gp_acquire_sample_rows() helper function on each
segment. That eventually calls float8out function on segment to
converts float8 number to a string with snprintf:

snprintf(ascii, MAXDOUBLEWIDTH + 1, "%.*g", ndig, num);
When ndig is 15 the maximum float8 value 1.7976931348623157e+308 is
rounded to "1.79769313486232e+308" that has no representation.

And on master acquire_sample_rows_dispatcher function
process gp_acquire_sample_rows result and eventually float8in
function is called to convert string to float8 with strtold:
val = strtold(num, &endptr);
This is where overflow for "1.79769313486232e+308" happens but works
fine for "1.7976931348623157e+308".

Transferring in binary mode allows to avoid conversion from double to
string on segments and then back to double on master. And this will
much faster than before.

Using CdbDispatchPlan instead of CdbDispatchCommand allows
to receive data in binary mode in MemTuple, and this is much faster than before.
And use tuplestore to store received tuples to avoid use too many memory.

Co-authored-by: zxuejing <[email protected]>
@CLAassistant
Copy link

CLAassistant commented Aug 26, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ zxuejing
❌ yanwr1
You have signed the CLA already but the status is still pending? Let us recheck it.

@Light-City Light-City force-pushed the fix/analyze_fix branch 2 times, most recently from 684e4e3 to d3948c6 Compare August 26, 2024 15:32
gp_bloat_diag use the given statistics(including stawidth) in
pg_statistic to compute the expected number of pages for the given
table.

To avoid consuming too much memory during analysis and/or too much
space in the resulting pg_statistic rows, ANALYZE ignores varlena
datums that are wider than WIDTH_THRESHOLD. So the average width of
column values is far smaller than the real average width, especially
for varlena datums which are larger than WIDTH_THRESHOLD but stored
uncompressed.

The wrong stawidth value causes gp_bloat_diag view shows tables have
bloat and require a "VACUUM" or "VACUUM FULL", although these tables
don't have bloat in fact.

'oversized_cols_length' is returned as an array which contains oversized
attributes' length.As the array can also be used to distinguish real
NULLs from values that were too large to be included in the sample,
remove 'oversized_cols_bitmap' results from the sample row.

Include the total length and number of oversized rows when computing
the stawidth.
@my-ship-it my-ship-it merged commit 87ec835 into apache:main Aug 28, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Analyze Overflow
5 participants