arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Sachs <jmsa...@gmail.com>
Subject Delta encoding in Apache Arrow or Parquet
Date Mon, 16 Nov 2020 20:30:49 GMT
Does Arrow / Parquet have any support for delta encoding?

Some data series compress better when their differences are stored rather than the values
themselves.

Here's an example where the differences are mostly equal to 7 but occasionally more:

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

N = 500000
delta_r = np.full(N,7)
np.random.seed(123)
for _ in range(10):
    delta_r[np.random.randint(N,size=N//100)] += 1
r = np.cumsum(delta_r)
drcheck = np.diff(r,prepend=0)
assert (delta_r == drcheck).all()

a = pa.array(r)
adiff = pa.array(delta_r)
t = pa.Table.from_arrays([a],['r'])
tdiff = pa.Table.from_arrays([adiff],['delta_r'])
pq.write_table(t,'t.pq')
pq.write_table(tdiff,'tdiff.pq')

=====

and when I look at the resulting files:

-rw-rw-rw-   1 user     group     2591101 Nov 16 13:29 t.pq
-rw-rw-rw-   1 user     group       81049 Nov 16 13:29 tdiff.pq


Mime
View raw message