arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Aly <robin....@nedap.com>
Subject Selection of encoding scheme
Date Fri, 01 Nov 2019 09:31:55 GMT
Hi,

I have a conceptual question about the selection of encoding schemes for parquet columns.
Hopefully I didn’t miss this question in the archive.

If I understand correctly, arrow implements “all” encoding schemes that parquet supports.
But how are these selected for given data of a column/dataset? Is this selection data driven
(test on a small subset)? Can I somehow influence the selection?

Background: I am using python to store a pandas dataframe with relative standard iot data
(device_id, timestamp, value).

device_id           timestamp     value
        0 2016-02-18 21:01:27  0.797649
        0 2016-02-18 23:01:27  0.485878
        0 2016-02-19 01:01:27  0.738183
        0 2016-02-19 03:01:27  0.866196
        0 2016-02-19 05:01:27  0.731805
      ...                 ...       ...
     9999 2016-04-17 08:49:21  0.794262
     9999 2016-04-17 10:49:21  0.659690
     9999 2016-04-17 12:49:21  0.885828
     9999 2016-04-17 14:49:21  0.000009
     9999 2016-04-17 16:49:21  0.805664

I am surprised that pyarrow doesn’t choose the delta / rle encoding for timestamp as it
is increasing in fixed deletas per device_id:


row group 0

--------------------------------------------------------------------------------

device_id:  INT64 GZIP DO:0 FPO:4 SZ:156990/83620663/532.65 VC:10451833 [more]...

timestamp:  INT64 GZIP DO:0 FPO:157081 SZ:54258488/83620743/1.54 VC:10451833 [more]...

value:      DOUBLE GZIP DO:0 FPO:54415661 SZ:78769352/83620743/1.06 VC:10451833 [more]...

Any help / pointers is welcome.

Cheers
Robin
Mime
View raw message