arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Delta encoding in Apache Arrow or Parquet
Date Mon, 16 Nov 2020 23:06:49 GMT
I believe by byte_stream_split encoding is supported now in C++ (at least
reading is, I would need to double check on the writing).

On Mon, Nov 16, 2020 at 3:05 PM Andrew Lamb <alamb@influxdata.com> wrote:

> For what it is worth, when we were testing with timeseries data (that also
> many sequential values that are very close in absolute value), the parquet
> BYTE_STREAM_SPLIT[1] encoding was also quite effective (20% better
> compression). However, this wasn't supported in C++ (and thus supported in
> Pandas) at that time.
>
> [1]
> https://github.com/apache/parquet-format/blob/ee02ef8c8f33bd3d5ed0582ded7e20439e12d933/Encodings.md#byte-stream-split-byte_stream_split--9
>
> On Mon, Nov 16, 2020 at 5:01 PM Jason Sachs <jmsachs@gmail.com> wrote:
>
>> ah .. got it.
>>
>> Thanks, I found
>> https://github.com/apache/parquet-format/blob/ee02ef8c8f33bd3d5ed0582ded7e20439e12d933/Encodings.md
>>
>> On 2020/11/16 20:33:38, Micah Kornfield <emkornfield@gmail.com> wrote:
>> > Delta encoding hasn't been implemented in the C++ code that pyarrow
>> binds
>> > to.  It is supported in the Parquet specification.
>> >
>> > On Mon, Nov 16, 2020 at 12:30 PM Jason Sachs <jmsachs@gmail.com> wrote:
>> >
>> > > Does Arrow / Parquet have any support for delta encoding?
>> > >
>> > > Some data series compress better when their differences are stored
>> rather
>> > > than the values themselves.
>> > >
>> > > Here's an example where the differences are mostly equal to 7 but
>> > > occasionally more:
>> > >
>> > > import numpy as np
>> > > import pyarrow as pa
>> > > import pyarrow.parquet as pq
>> > >
>> > > N = 500000
>> > > delta_r = np.full(N,7)
>> > > np.random.seed(123)
>> > > for _ in range(10):
>> > >     delta_r[np.random.randint(N,size=N//100)] += 1
>> > > r = np.cumsum(delta_r)
>> > > drcheck = np.diff(r,prepend=0)
>> > > assert (delta_r == drcheck).all()
>> > >
>> > > a = pa.array(r)
>> > > adiff = pa.array(delta_r)
>> > > t = pa.Table.from_arrays([a],['r'])
>> > > tdiff = pa.Table.from_arrays([adiff],['delta_r'])
>> > > pq.write_table(t,'t.pq')
>> > > pq.write_table(tdiff,'tdiff.pq')
>> > >
>> > > =====
>> > >
>> > > and when I look at the resulting files:
>> > >
>> > > -rw-rw-rw-   1 user     group     2591101 Nov 16 13:29 t.pq
>> > > -rw-rw-rw-   1 user     group       81049 Nov 16 13:29 tdiff.pq
>> > >
>> > >
>> >
>>
>

Mime
View raw message