arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Sachs <jmsa...@gmail.com>
Subject Re: Best way to store ragged packet data in Parquet files
Date Tue, 03 Nov 2020 21:26:34 GMT


On 2020/11/03 20:49:46, Micah Kornfield <emkornfield@gmail.com> wrote: 
> Hi Jason,
> At least as a first pass I would try to avoid the padding and storing the
> length separately in Parquet.  Using one column for timestamp and one
> column of bytes for the data is what I would try first.  If there is any
> structure to the packets splitting them into the structure could also help.
> 
> -Micah

For the test cases I have, >99% of the packets are the same length, so there's little-to-no
benefit of removing the padding; the length field and zero padding barely adds anything once
you factor compression into the mix.

I've tried use_dictionaries=False and that does help some.

But I'll post an updated example to back these statements up and see how much better I can
get.

I'm just surprised that hdf5 does a better job in this case; maybe I don't understand the
constraints the file format imposes on data compression.

Mime
View raw message