arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Delta encoding in Apache Arrow or Parquet
Date Mon, 16 Nov 2020 20:33:38 GMT
Delta encoding hasn't been implemented in the C++ code that pyarrow binds
to.  It is supported in the Parquet specification.

On Mon, Nov 16, 2020 at 12:30 PM Jason Sachs <jmsachs@gmail.com> wrote:

> Does Arrow / Parquet have any support for delta encoding?
>
> Some data series compress better when their differences are stored rather
> than the values themselves.
>
> Here's an example where the differences are mostly equal to 7 but
> occasionally more:
>
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
>
> N = 500000
> delta_r = np.full(N,7)
> np.random.seed(123)
> for _ in range(10):
>     delta_r[np.random.randint(N,size=N//100)] += 1
> r = np.cumsum(delta_r)
> drcheck = np.diff(r,prepend=0)
> assert (delta_r == drcheck).all()
>
> a = pa.array(r)
> adiff = pa.array(delta_r)
> t = pa.Table.from_arrays([a],['r'])
> tdiff = pa.Table.from_arrays([adiff],['delta_r'])
> pq.write_table(t,'t.pq')
> pq.write_table(tdiff,'tdiff.pq')
>
> =====
>
> and when I look at the resulting files:
>
> -rw-rw-rw-   1 user     group     2591101 Nov 16 13:29 t.pq
> -rw-rw-rw-   1 user     group       81049 Nov 16 13:29 tdiff.pq
>
>

Mime
View raw message