arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Best way to store ragged packet data in Parquet files
Date Fri, 13 Nov 2020 06:44:51 GMT
>
> For the test cases I have, >99% of the packets are the same length, so
> there's little-to-no benefit of removing the padding; the length field and
> zero padding barely adds anything once you factor compression into the mix.


Are you writing the data out as fixed size bytes arrays or as variable
length binary data?

On Tue, Nov 3, 2020 at 1:26 PM Jason Sachs <jmsachs@gmail.com> wrote:

>
>
> On 2020/11/03 20:49:46, Micah Kornfield <emkornfield@gmail.com> wrote:
> > Hi Jason,
> > At least as a first pass I would try to avoid the padding and storing the
> > length separately in Parquet.  Using one column for timestamp and one
> > column of bytes for the data is what I would try first.  If there is any
> > structure to the packets splitting them into the structure could also
> help.
> >
> > -Micah
>
> For the test cases I have, >99% of the packets are the same length, so
> there's little-to-no benefit of removing the padding; the length field and
> zero padding barely adds anything once you factor compression into the mix.
>
> I've tried use_dictionaries=False and that does help some.
>
> But I'll post an updated example to back these statements up and see how
> much better I can get.
>
> I'm just surprised that hdf5 does a better job in this case; maybe I don't
> understand the constraints the file format imposes on data compression.
>

Mime
View raw message