GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary #3085

raunaqmorarka · 2024-11-28T13:47:31Z

Rationale for this change

The current default for V1 pages is PLAIN encoding. This encoding mixes string length with string data. This is inefficient for for skipping N values, as the encoding does not allow random access. It's also slow to decode as the interleaving of lengths with data does not allow efficient batched implementations and forces most implementations to make copies of the data to fit the usual representation of separate offsets and data for strings.

DELTA_LENGTH_BYTE_ARRAY has none of the above problems as it separates offsets and data. The parquet-format spec also seems to recommend this https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/Encodings.md?plain=1#L299

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

The current default for V1 pages is PLAIN encoding. This encoding mixes string length with string data. This is inefficient for for skipping N values, as the encoding does not allow random access. It's also slow to decode as the interleaving of lengths with data does not allow efficient batched implementations and forces most implementations to make copies of the data to fit the usual representation of separate offsets and data for strings. DELTA_LENGTH_BYTE_ARRAY has none of the above problems as it separates offsets and data. The parquet-format spec also seems to recommend this https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/Encodings.md?plain=1#L299

Fokko · 2024-11-28T14:05:36Z

Hey @raunaqmorarka thanks for raising this. I think we want to discuss on the devlist first if we want to change behavior. Would you be interested to raise this?

raunaqmorarka · 2024-11-28T14:45:43Z

Hey @raunaqmorarka thanks for raising this. I think we want to discuss on the devlist first if we want to change behavior. Would you be interested to raise this?

I'm not sure how to start a discussion on the devlist, I don't have credentials to login there.
It would be nice to discuss on the GH issue #3083 if that's possible

wgtmac · 2024-11-28T15:16:02Z

@raunaqmorarka You can send an email to [email protected] to subscribe. If you don't want to subscribe, you may directly send an email to [email protected]. You can see https://lists.apache.org/[email protected] for reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary #3085

GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary #3085

raunaqmorarka commented Nov 28, 2024

Fokko commented Nov 28, 2024

raunaqmorarka commented Nov 28, 2024

wgtmac commented Nov 28, 2024

GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary #3085

Are you sure you want to change the base?

GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary #3085

Conversation

raunaqmorarka commented Nov 28, 2024

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Fokko commented Nov 28, 2024

raunaqmorarka commented Nov 28, 2024

wgtmac commented Nov 28, 2024