-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send rows in binary mode for ANALYZE #601
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We cannot use the same method as PostgreSQL does to calculate the correlation in QD. When we collect data from segments to QD, this will change the physical order of the data. such as in segment 1 the data is 1,3,5,7,9. And in segment 2 the data is 2,4,6,8,10. In each segment the data is ordered, and correlation is 1 in each segment. But after we collect the data to QD, it may be 1,3,5,2,4,7,9,6,8,10. And the correlation is 0.3 or something else and it is not stable. And this will increase the cost of index scan which is shouldn't be done. So get correlations from segments and then calculate correlation for QD. we use the weighted mean algorithm to calculate correlation on QD, However, In some situations, we may not be able to obtain reltuples of a table, such as none-leaf part of partitioned table or the parent table of the inherited table. So we can only use the mean algorithm to calculate correlation for these tables.
Send results of select pg_catalog.gp_acquire_sample_rows query in binary mode. That allows to avoid overflow for max double. For example, if run the following prior to this fix: set extra_float_digits to 0; create table t (a double precision); insert into t values (1.7976931348623157e+308); analyze t; the following message will be printed: ERROR: value out of range: overflow For text mode (default) when analyze for table is performed the master calls gp_acquire_sample_rows() helper function on each segment. That eventually calls float8out function on segment to converts float8 number to a string with snprintf: snprintf(ascii, MAXDOUBLEWIDTH + 1, "%.*g", ndig, num); When ndig is 15 the maximum float8 value 1.7976931348623157e+308 is rounded to "1.79769313486232e+308" that has no representation. And on master acquire_sample_rows_dispatcher function process gp_acquire_sample_rows result and eventually float8in function is called to convert string to float8 with strtold: val = strtold(num, &endptr); This is where overflow for "1.79769313486232e+308" happens but works fine for "1.7976931348623157e+308". Transferring in binary mode allows to avoid conversion from double to string on segments and then back to double on master. And this will much faster than before. Using CdbDispatchPlan instead of CdbDispatchCommand allows to receive data in binary mode in MemTuple, and this is much faster than before. And use tuplestore to store received tuples to avoid use too many memory. Co-authored-by: zxuejing <[email protected]>
|
Light-City
force-pushed
the
fix/analyze_fix
branch
2 times, most recently
from
August 26, 2024 15:32
684e4e3
to
d3948c6
Compare
gp_bloat_diag use the given statistics(including stawidth) in pg_statistic to compute the expected number of pages for the given table. To avoid consuming too much memory during analysis and/or too much space in the resulting pg_statistic rows, ANALYZE ignores varlena datums that are wider than WIDTH_THRESHOLD. So the average width of column values is far smaller than the real average width, especially for varlena datums which are larger than WIDTH_THRESHOLD but stored uncompressed. The wrong stawidth value causes gp_bloat_diag view shows tables have bloat and require a "VACUUM" or "VACUUM FULL", although these tables don't have bloat in fact. 'oversized_cols_length' is returned as an array which contains oversized attributes' length.As the array can also be used to distinguish real NULLs from values that were too large to be included in the sample, remove 'oversized_cols_bitmap' results from the sample row. Include the total length and number of oversized rows when computing the stawidth.
Light-City
force-pushed
the
fix/analyze_fix
branch
from
August 26, 2024 23:30
d42e4e8
to
860dab7
Compare
my-ship-it
approved these changes
Aug 28, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fix #569
Change logs
Send results of select pg_catalog.gp_acquire_sample_rows query in binary mode.
That allows to avoid overflow for max double.
For example, if run the following prior to this fix:
so cherry-pick related commits from gpdb-archive.
get correlation from segments instead of calculating it on QD (#15357)
Consider oversized rows for stawidth computation
Send rows in binary mode for ANALYZE
Why are the changes needed?
Does this PR introduce any user-facing change?
no.
How was this patch tested?
icw test: analyze.sql
Contributor's Checklist
Here are some reminders and checklists before/when submitting your pull request, please check them:
make installcheck
make -C src/test installcheck-cbdb-parallel
cloudberrydb/dev
team for review and approval when your PR is ready🥳