From user-return-796-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Mon Nov 16 23:07:10 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id DB2E118064A for ; Tue, 17 Nov 2020 00:07:09 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 25553476D9 for ; Mon, 16 Nov 2020 23:07:09 +0000 (UTC) Received: (qmail 59470 invoked by uid 500); 16 Nov 2020 23:07:08 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 59460 invoked by uid 99); 16 Nov 2020 23:07:08 -0000 Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Nov 2020 23:07:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id F266EBFD6E for ; Mon, 16 Nov 2020 23:07:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org X-Spam-Flag: NO X-Spam-Score: -0.002 X-Spam-Level: X-Spam-Status: No, score=-0.002 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024) with ESMTP id o9pp4YIjfFUC for ; Mon, 16 Nov 2020 23:07:07 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.208.43; helo=mail-ed1-f43.google.com; envelope-from=emkornfield@gmail.com; receiver= Received: from mail-ed1-f43.google.com (mail-ed1-f43.google.com [209.85.208.43]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 9FEF4BCA66 for ; Mon, 16 Nov 2020 23:07:06 +0000 (UTC) Received: by mail-ed1-f43.google.com with SMTP id t9so20480969edq.8 for ; Mon, 16 Nov 2020 15:07:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to; bh=rpauOUU5R/tO1SnlNW2qKmmcdyZwyn4d9qL5itTIkCU=; b=K7xn1dHa/zGW6z04iHSInyxYqhliEzmyNYlnDsewEvR/yTOQ3QsyJAuSJMy6kaoeez reVTbOPsws9xauielXI77z0PYaXMnYYlVVeg6UJRun0pWCa6fFaauF3fUI+J3prEssLu g5nmccDPkYqlOoKxjbCTnLM4SaWUc8H3JH9FbTWMELUyz7OPxOAX/jmXDclRK8jQtyK9 0vOlNR6cTjkjYwXqoME08WYcTh9iz0DBqwwsmbsX/HTY4rdI7PN8VkjOdTUgs1ApynCs NHiJjmvWWfawzSVb58Rv1R6OhsD2Fzdv3YoX8+rAANTu4MmHcrc08spyucMtBkREO2hV MnYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to; bh=rpauOUU5R/tO1SnlNW2qKmmcdyZwyn4d9qL5itTIkCU=; b=OpSj5Bebz3oTxQGuRwmLQ6I8Z0yebQAEp5ZHCjn1pXUOgO2/2Ea9hOBi3ymOtdvuU7 Qjz/ZMX/ivwf4gkZiGteRy/OtHhYAaH3DmYCKh5XOQXJuIq+T0QnGfjYM74i0Xz1fjwy Upubr6L7iil0vuDNI3d+dAKuW33iLIhtQ5c/d6j02peJyGp7MIpzjwLywhhtd/29FY7/ 6VKearUYR5YF8xLfGbUsbFwgwibtsBrNDHIG0/dQ3vyoePbLmewr77j2HBLprbYZeYoI WDGFd+95ogT4rB0M7XMGK/pvhmRfTXqvlnYXxBIjD2HRAvXDJyuou+66GD6GvLWfaeUz wulQ== X-Gm-Message-State: AOAM532hdLfFuFq2QSMz9pAJPOAjEKeey1TmlXHi+ytKwaqlGQeRvqdk fdVFXmt8Vle0GeQoI4TuvCl5h2kIiCwZzEhOgC4uyfnncng= X-Google-Smtp-Source: ABdhPJwBFX6HLce+nhj+htcmOoPPNumB9A98mE6UP7h9PDkkDZrLk8cQ0CC8JvX/Q2n1C0QHNwnMnhu4r6esdsGqLZo= X-Received: by 2002:a50:cf82:: with SMTP id h2mr17945790edk.142.1605568020419; Mon, 16 Nov 2020 15:07:00 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: Reply-To: emkornfield@gmail.com From: Micah Kornfield Date: Mon, 16 Nov 2020 15:06:49 -0800 Message-ID: Subject: Re: Delta encoding in Apache Arrow or Parquet To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="00000000000085172a05b4416f2a" --00000000000085172a05b4416f2a Content-Type: text/plain; charset="UTF-8" I believe by byte_stream_split encoding is supported now in C++ (at least reading is, I would need to double check on the writing). On Mon, Nov 16, 2020 at 3:05 PM Andrew Lamb wrote: > For what it is worth, when we were testing with timeseries data (that also > many sequential values that are very close in absolute value), the parquet > BYTE_STREAM_SPLIT[1] encoding was also quite effective (20% better > compression). However, this wasn't supported in C++ (and thus supported in > Pandas) at that time. > > [1] > https://github.com/apache/parquet-format/blob/ee02ef8c8f33bd3d5ed0582ded7e20439e12d933/Encodings.md#byte-stream-split-byte_stream_split--9 > > On Mon, Nov 16, 2020 at 5:01 PM Jason Sachs wrote: > >> ah .. got it. >> >> Thanks, I found >> https://github.com/apache/parquet-format/blob/ee02ef8c8f33bd3d5ed0582ded7e20439e12d933/Encodings.md >> >> On 2020/11/16 20:33:38, Micah Kornfield wrote: >> > Delta encoding hasn't been implemented in the C++ code that pyarrow >> binds >> > to. It is supported in the Parquet specification. >> > >> > On Mon, Nov 16, 2020 at 12:30 PM Jason Sachs wrote: >> > >> > > Does Arrow / Parquet have any support for delta encoding? >> > > >> > > Some data series compress better when their differences are stored >> rather >> > > than the values themselves. >> > > >> > > Here's an example where the differences are mostly equal to 7 but >> > > occasionally more: >> > > >> > > import numpy as np >> > > import pyarrow as pa >> > > import pyarrow.parquet as pq >> > > >> > > N = 500000 >> > > delta_r = np.full(N,7) >> > > np.random.seed(123) >> > > for _ in range(10): >> > > delta_r[np.random.randint(N,size=N//100)] += 1 >> > > r = np.cumsum(delta_r) >> > > drcheck = np.diff(r,prepend=0) >> > > assert (delta_r == drcheck).all() >> > > >> > > a = pa.array(r) >> > > adiff = pa.array(delta_r) >> > > t = pa.Table.from_arrays([a],['r']) >> > > tdiff = pa.Table.from_arrays([adiff],['delta_r']) >> > > pq.write_table(t,'t.pq') >> > > pq.write_table(tdiff,'tdiff.pq') >> > > >> > > ===== >> > > >> > > and when I look at the resulting files: >> > > >> > > -rw-rw-rw- 1 user group 2591101 Nov 16 13:29 t.pq >> > > -rw-rw-rw- 1 user group 81049 Nov 16 13:29 tdiff.pq >> > > >> > > >> > >> > --00000000000085172a05b4416f2a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I believe by byte_stream_split encoding is supported now i= n C++ (at least reading is, I would need to double check on the writing).
On= Mon, Nov 16, 2020 at 3:05 PM Andrew Lamb <alamb@influxdata.com> wrote:
For what it is worth, whe= n we were testing with timeseries data (that also many sequential values th= at are very close in absolute value), the parquet BYTE_STREAM_SPLIT[1] enco= ding was also quite effective (20% better compression). However, this wasn&= #39;t supported in C++ (and thus supported in Pandas) at that time.

On Mon, Nov 16, 2020 at 5:01 PM Jason Sach= s <jmsachs@gmail.= com> wrote:
ah .. got it.

Thanks, I found https://github.com/apache/parquet-format/blob/ee02ef8c8f33bd= 3d5ed0582ded7e20439e12d933/Encodings.md

On 2020/11/16 20:33:38, Micah Kornfield <emkornfield@gmail.com> wrote:
> Delta encoding hasn't been implemented in the C++ code that pyarro= w binds
> to.=C2=A0 It is supported in the Parquet specification.
>
> On Mon, Nov 16, 2020 at 12:30 PM Jason Sachs <jmsachs@gmail.com> wrote:
>
> > Does Arrow / Parquet have any support for delta encoding?
> >
> > Some data series compress better when their differences are store= d rather
> > than the values themselves.
> >
> > Here's an example where the differences are mostly equal to 7= but
> > occasionally more:
> >
> > import numpy as np
> > import pyarrow as pa
> > import pyarrow.parquet as pq
> >
> > N =3D 500000
> > delta_r =3D np.full(N,7)
> > np.random.seed(123)
> > for _ in range(10):
> >=C2=A0 =C2=A0 =C2=A0delta_r[np.random.randint(N,size=3DN//100)] += =3D 1
> > r =3D np.cumsum(delta_r)
> > drcheck =3D np.diff(r,prepend=3D0)
> > assert (delta_r =3D=3D drcheck).all()
> >
> > a =3D pa.array(r)
> > adiff =3D pa.array(delta_r)
> > t =3D pa.Table.from_arrays([a],['r'])
> > tdiff =3D pa.Table.from_arrays([adiff],['delta_r'])
> > pq.write_table(t,'t.pq')
> > pq.write_table(tdiff,'tdiff.pq')
> >
> > =3D=3D=3D=3D=3D
> >
> > and when I look at the resulting files:
> >
> > -rw-rw-rw-=C2=A0 =C2=A01 user=C2=A0 =C2=A0 =C2=A0group=C2=A0 =C2= =A0 =C2=A02591101 Nov 16 13:29 t.pq
> > -rw-rw-rw-=C2=A0 =C2=A01 user=C2=A0 =C2=A0 =C2=A0group=C2=A0 =C2= =A0 =C2=A0 =C2=A081049 Nov 16 13:29 tdiff.pq
> >
> >
>
--00000000000085172a05b4416f2a--