arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bowman <>
Subject Re: Need 64-bit Integer length for Parquet ByteArray Type
Date Fri, 05 Apr 2019 18:29:17 GMT
My hope is that these large ByteArray values will encode/compress to a fraction of their original
size.  FWIW, cpp/src/parquet/<> has int64_t
offset and length fields all over the place.

External file references to BLOBS is doable but not the elegant, integrated solution I was
hoping for.


On Apr 5, 2019, at 1:53 PM, Ryan Blue <<>>


Looks like we will need a new encoding for this:

That doc specifies that the plain encoding uses a 4-byte length. That's not going to be a
quick fix.

Now that I'm thinking about this a bit more, does it make sense to support byte arrays that
are more than 2GB? That's far larger than the size of a row group, let alone a page. This
would completely break memory management in the JVM implementation.

Can you solve this problem using a BLOB type that references an external file with the gigantic
values? Seems to me that values this large should go in separate files, not in a Parquet file
where it would destroy any benefit from using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <<>>
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift message methods.
 Am I missing anything?

From cpp/src/parquet/types.h

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)


´╗┐On 4/5/19, 1:32 PM, "Ryan Blue" <<>>


    Hi Brian,

    This seems like something we should allow. What imposes the current limit?
    Is it in the thrift format, or just the implementations?

    On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <<>>

    > All,
    > SAS requires support for storing varying-length character and binary blobs
    > with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
    > a unint32_t.   Looks this the will require incrementing the Parquet file
    > format version and changing ByteArray len to uint64_t.
    > Have there been any requests for this or other Parquet developments that
    > require file format versioning changes?
    > I realize this a non-trivial ask.  Thanks for considering it.
    > -Brian

    Ryan Blue
    Software Engineer

Ryan Blue
Software Engineer
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message