Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to read google.protobuf.UInt32Value from parquet #3112

Open
0x26res opened this issue Jan 3, 2025 · 0 comments · May be fixed by #3113
Open

Fail to read google.protobuf.UInt32Value from parquet #3112

0x26res opened this issue Jan 3, 2025 · 0 comments · May be fixed by #3113

Comments

@0x26res
Copy link

0x26res commented Jan 3, 2025

Describe the bug, including details regarding any error messages, version, and platform.

TLDR: the parquet protobuf reader doesn't work for UInt32Value

I have protobuf using wrapped unsigned and signed integer:

syntax = "proto3";

package org.apache.parquet.test;

import "google/protobuf/wrappers.proto";

message MyTestMessage {
    google.protobuf.UInt32Value uint32_field = 11;
    google.protobuf.Int32Value int32_field = 12;
}

I then generate a parquet file for that data:

import pyarrow as pa
import pyarrow.parquet as pq

pq.write_table(
    pa.table(
        {
            "uint32_field": pa.array([None, None, 28], pa.uint32()),
            "int32_field": pa.array([None, 28, 28], pa.int32()),
        }
    ),
    "/tmp/my_test_messages.parquet",
)

And I try to read it using parquet-java (in kotlin, but it doesn't matter):

package org.apache.parquet.test

import org.apache.parquet.test.MyTestMessage
import com.google.protobuf.Int32Value
import io.kotest.matchers.shouldBe
import org.apache.hadoop.fs.Path
import org.apache.parquet.proto.ProtoConstants
import org.apache.parquet.proto.ProtoParquetReader
import org.apache.parquet.proto.ProtoReadSupport
import org.junit.jupiter.api.Test

class TestUInt32Value {
  @Test
  fun `test can not load UInt32Value`() {
    val reader =
      ProtoParquetReader.builder<MyTestMessage.Builder>(
          Path("file:///tmp/my_test_messages.parquet")
        )
        .set(ProtoReadSupport.PB_CLASS, MyTestMessage::class.java.canonicalName)
        .set(ProtoConstants.CONFIG_IGNORE_UNKNOWN_FIELDS, "true")
        .build()
    val firstMessage = reader.read().build()
    firstMessage shouldBe MyTestMessage.getDefaultInstance()

    val secondMessage = reader.read().build()
    secondMessage shouldBe
      MyTestMessage.newBuilder().setInt32Field(Int32Value.of(28)).build()


    val thirdMessage = reader.read()
  }
}

I get this error when reading the third message:

    org.apache.parquet.io.ParquetDecodingException: Can not read value at 3 in block 0 in file file:/tmp/my_test_messages.parquet
        at app//org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
        at app//org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
        at app//org.apache.parquet.test.TestUInt32Value.test can load bad not nested plain(TestUInt32Value.kt:29)
        Caused by:
        java.lang.UnsupportedOperationException: org.apache.parquet.proto.ProtoMessageConverter$ProtoUInt32ValueConverter
            at org.apache.parquet.io.api.PrimitiveConverter.addInt(PrimitiveConverter.java:101)
            at org.apache.parquet.column.impl.ColumnReaderBase$2$3.writeValue(ColumnReaderBase.java:321)
            at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:486)
            at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
            at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:425)
            at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:249)
            ... 2 more

A few thing to note:

  • this works for the second message, which means it is implemented correctly for (signed) Int32Value
  • It works if you generate the data using the JVM.
    But this is because when you do so the parquet table has got a different structure (each message is a nested struct {"value": 28}
  @Test
  fun `test jvm round trip`() {

    val path = Path("file:///tmp/my_test_messages_jvm.parquet")

    ProtoParquetWriter.builder<MyTestMessage>(path)
      .withMessage(MyTestMessage::class.java)
      .build()
      .use {
        it.write(MyTestMessage.getDefaultInstance())
        it.write(MyTestMessage.newBuilder().setInt32Field(Int32Value.of(28)).build())
        it.write(
          MyTestMessage.newBuilder()
            .setInt32Field(Int32Value.of(1))
            .setUint32Field(UInt32Value.of(32))
            .build()
        )
        it.close()
      }

    val reader =
      ProtoParquetReader.builder<MyTestMessage.Builder>(path)
        .set(ProtoReadSupport.PB_CLASS, MyTestMessage::class.java.canonicalName)
        .set(ProtoConstants.CONFIG_IGNORE_UNKNOWN_FIELDS, "true")
        .build()
    val firstMessage = reader.read().build()
    firstMessage shouldBe MyTestMessage.getDefaultInstance()

    val secondMessage = reader.read().build()
    secondMessage shouldBe MyTestMessage.newBuilder().setInt32Field(Int32Value.of(28)).build()

    val thirdMessage = reader.read().build()
    thirdMessage shouldBe MyTestMessage.newBuilder()
      .setInt32Field(Int32Value.of(1))
      .setUint32Field(UInt32Value.of(32))
      .build()
  }

This is basically generating a table that looks like this:

import pyarrow as pa

pa.table(
    {
        "uint32_field": pa.array(
            [None, None, {"value": 28}], pa.struct([("value", pa.uint32())])
        ),
        "int32_field": pa.array(
            [None, {"value": 28}, {"value": 28}], pa.struct([("value", pa.int32())])
        ),
    }
)

Component(s)

Protobuf

@0x26res 0x26res linked a pull request Jan 6, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant