Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary #3085

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

raunaqmorarka
Copy link

Rationale for this change

The current default for V1 pages is PLAIN encoding. This encoding mixes string length with string data. This is inefficient for for skipping N values, as the encoding does not allow random access. It's also slow to decode as the interleaving of lengths with data does not allow efficient batched implementations and forces most implementations to make copies of the data to fit the usual representation of separate offsets and data for strings.

DELTA_LENGTH_BYTE_ARRAY has none of the above problems as it separates offsets and data. The parquet-format spec also seems to recommend this https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/Encodings.md?plain=1#L299

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

The current default for V1 pages is PLAIN encoding. This encoding mixes string length with string data.
This is inefficient for for skipping N values, as the encoding does not allow random access.
It's also slow to decode as the interleaving of lengths with data does not allow efficient batched
implementations and forces most implementations to make copies of the data to fit the
usual representation of separate offsets and data for strings.

DELTA_LENGTH_BYTE_ARRAY has none of the above problems as it separates offsets and data.
The parquet-format spec also seems to recommend this
https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/Encodings.md?plain=1#L299
@Fokko
Copy link
Contributor

Fokko commented Nov 28, 2024

Hey @raunaqmorarka thanks for raising this. I think we want to discuss on the devlist first if we want to change behavior. Would you be interested to raise this?

@raunaqmorarka
Copy link
Author

Hey @raunaqmorarka thanks for raising this. I think we want to discuss on the devlist first if we want to change behavior. Would you be interested to raise this?

I'm not sure how to start a discussion on the devlist, I don't have credentials to login there.
It would be nice to discuss on the GH issue #3083 if that's possible

@wgtmac
Copy link
Member

wgtmac commented Nov 28, 2024

@raunaqmorarka You can send an email to [email protected] to subscribe. If you don't want to subscribe, you may directly send an email to [email protected]. You can see https://lists.apache.org/[email protected] for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants