arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Selection of encoding scheme
Date Fri, 01 Nov 2019 15:41:44 GMT
hi Robin,

The only encodings supported currently in C++ via pyarrow are
dictionary encoding and plain encoding. If the dictionary grows too
large, then it "falls back" to plain encoding. More details here

http://arrow.apache.org/blog/2019/09/05/faster-strings-cpp-parquet/

There are "V2" encodings for may be more efficient for your data, but
these need some implementation love to be made available. Note that
ParquetV2 files are not considered "production" so if you use these V2
encodings, your files may not be readable everywhere.

- Wes

On Fri, Nov 1, 2019 at 4:32 AM Robin Aly <robin.aly@nedap.com> wrote:
>
> Hi,
>
>
>
> I have a conceptual question about the selection of encoding schemes for parquet columns.
Hopefully I didn’t miss this question in the archive.
>
>
>
> If I understand correctly, arrow implements “all” encoding schemes that parquet supports.
But how are these selected for given data of a column/dataset? Is this selection data driven
(test on a small subset)? Can I somehow influence the selection?
>
>
>
> Background: I am using python to store a pandas dataframe with relative standard iot
data (device_id, timestamp, value).
>
>
>
> device_id           timestamp     value
>
>         0 2016-02-18 21:01:27  0.797649
>
>         0 2016-02-18 23:01:27  0.485878
>
>         0 2016-02-19 01:01:27  0.738183
>
>         0 2016-02-19 03:01:27  0.866196
>
>         0 2016-02-19 05:01:27  0.731805
>
>       ...                 ...       ...
>
>      9999 2016-04-17 08:49:21  0.794262
>
>      9999 2016-04-17 10:49:21  0.659690
>
>      9999 2016-04-17 12:49:21  0.885828
>
>      9999 2016-04-17 14:49:21  0.000009
>
>      9999 2016-04-17 16:49:21  0.805664
>
>
>
> I am surprised that pyarrow doesn’t choose the delta / rle encoding for timestamp as
it is increasing in fixed deletas per device_id:
>
>
>
> row group 0
>
> --------------------------------------------------------------------------------
>
> device_id:  INT64 GZIP DO:0 FPO:4 SZ:156990/83620663/532.65 VC:10451833 [more]...
>
> timestamp:  INT64 GZIP DO:0 FPO:157081 SZ:54258488/83620743/1.54 VC:10451833 [more]...
>
> value:      DOUBLE GZIP DO:0 FPO:54415661 SZ:78769352/83620743/1.06 VC:10451833 [more]...
>
>
>
> Any help / pointers is welcome.
>
>
>
> Cheers
>
> Robin

Mime
View raw message