arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Sachs <jmsa...@gmail.com>
Subject Best way to store ragged packet data in Parquet files
Date Tue, 03 Nov 2020 20:41:34 GMT
(reposted here; I had posted to dev@arrow by mistake)

Hi all--

I've been getting started with Parquet as a storage alternative to HDF5 and it has a lot of
attractive quantities including compression flexibility efficiency.

But I'm stumped for storage efficiency in Parquet with one type of data that I have.

This is a large series of "ragged" packets arriving as a stream, where each packet consists
of up to 255 bytes of binary data. The vast majority of the packets have lengths between 96
and 112 bytes. I need to store each of them with a 64-bit timestamp.

I can get a good storage efficiency with HDF5 with the following table schema using pytables:

class StoredPacket(pt.IsDescription):
    timetick = pt.UInt64Col(pos=0)
    length = pt.UInt16Col(pos=1)
    data = pt.UInt8Col(pos=2,shape=(255,))

This stores packet data as an array of uint8 with length 255. I zero-pad the packet to length
255 and store the length as well in a separate column.

I have created a sample file in a Github gist: https://gist.github.com/jason-sachs/aa6dbdaced806bb76bc7a347dfc303dc
(see test1.h5) along with a Python script convert_test1.py that converts it to a Pandas DataFrame
and stores it via Parquet. But the Parquet files are almost twice as large as the .h5 file
no matter what storage technique I use; brotli is best but slow, and zstd is almost as good
as brotli but much faster.

Any suggestions on how I might improve storage efficiency in Parquet? I have a lot of flexibility
with how I can store the data; my only requirement is that I can retrieve the data packets
quickly from the storage file. I offer this sample file as a test case.

(py3) C:\tmp\git\dv\test-h5-gist>python convert_test1.py
Table overview:
       timetick  length                                               data
0            16      99  b'\x00\x00\x00\x98:B\x1a\xbev\x90\xb2\x00\x00\...
1            32      99  b'\x01\x08\x00\xbf:\x8b\x1a{r=\xb2\x88\x00\t\x...
2            48      99  b'\x02\x10\x00\xe7:\x9c\x1c\x1at:\xb3\x10\x01\...
3            64      99  b"\x03\x18\x00\x0f;\x16\x1bOt|\xb2\x98\x01\x19...
4            80      99  b'\x04 \x007;c\x1b\xddt~\xb2 \x02!\x00<;x\x1a\...
..         ...     ...                                                ...
16413    262080      99  b'{\xd8\xff\x1d+\xe6\xc5H)r\xc1X\xfd\xd9\xff +...
16414    262096      99  b'|\xe0\xff6+g\xc5A,\x0c\xc3\xe0\xfd\xe1\xff9+...
16415    262112      99  b'}\xe8\xffN+\xd3\xc4")D\xc2h\xfe\xe9\xffQ+M\x...
16416    262128      99  b"~\xf0\xffg+=\xc5E';\xc2\xf0\xfe\xf1\xffj+\xf...
16417    262144      99  b"\x7f\xf8\xff\x81+\x13\xc4\xdd'\x15\xc2x\xff\...

[16418 rows x 3 columns]

Packets with tags >= 128:
       timetick  length                                               data
179        2864      36  b"\xca'Twas brillig, and the slithy toves\x00\...
307        4896      35  b'\xca  Did gyre and gimble in the wabe:\x00\x...
340        5408      30  b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
362        5744      31  b'\xca  And the mome raths outgrabe.\x00\x00\x...
651       10352       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
1403      22368      32  b'\xca"Beware the Jabberwock, my son!\x00\x00\...
1741      27760      44  b'\xca  The jaws that bite, the claws that cat...
2115      33728      33  b'\xcaBeware the Jubjub bird, and shun\x00\x00...
2162      34464      30  b'\xca  The frumious Bandersnatch!"\x00\x00\x0...
2278      36304       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
2405      38320      34  b'\xcaHe took his vorpal sword in hand:\x00\x0...
2675      42624      41  b'\xca  Long time the manxome foe he sought --...
2896      46144      33  b'\xcaSo rested he by the Tumtum tree,\x00\x00...
3611      57568      31  b'\xca  And stood awhile in thought.\x00\x00\x...
4089      65200       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
5231      83456      36  b'\xcaAnd, as in uffish thought he stood,\x00\...
5236      83520      38  b'\xca  The Jabberwock, with eyes of flame,\x0...
5427      86560      40  b'\xcaCame whiffling through the tulgey wood,\...
6904     110176      26  b'\xca  And burbled as it came!\x00\x00\x00\x0...
7003     111744       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
7286     116256      44  b'\xcaOne, two! One, two! And through and thro...
8226     131280      39  b'\xca  The vorpal blade went snicker-snack!\x...
8370     133568      35  b'\xcaHe left it dead, and with its head\x00\x...
8849     141216      27  b'\xca  He went galumphing back.\x00\x00\x00\x...
10326    164832       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
11867    189472      37  b'\xca"And, has thou slain the Jabberwock?\x00...
12392    197856      35  b'\xca  Come to my arms, my beamish boy!\x00\x...
12936    206544      34  b"\xcaO frabjous day! Callooh! Callay!'\x00\x0...
13794    220256      26  b'\xca  He chortled in his joy.\x00\x00\x00\x0...
13905    222016       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
14690    234560      36  b"\xca'Twas brillig, and the slithy toves\x00\...
15317    244576      35  b'\xca  Did gyre and gimble in the wabe;\x00\x...
15840    252928      30  b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
16339    260896      31  b'\xca  And the mome raths outgrabe.\x00\x00\x...

(py3) C:\tmp\git\dv\test-h5-gist>ls -l test1.*
-rw-rw-rw-   1 user     group      908773 Nov  2 13:07 test1.h5
-rw-rw-rw-   1 user     group     1611025 Nov  2 13:35 test1.pq

(py3) C:\tmp\git\dv\test-h5-gist>h5ls -v -r test1.h5
Opened "test1.h5" with sec2 driver.
/                        Group
    Attribute: CLASS scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "GROUP"
    Attribute: PYTABLES_FORMAT_VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "2.1"
    Attribute: TITLE null
        Type:      1-byte null-terminated UTF-8 string

    Attribute: VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "1.0"
    Location:  1:96
    Links:     1
/data                    Group
    Attribute: CLASS scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "GROUP"
    Attribute: TITLE null
        Type:      1-byte null-terminated UTF-8 string

    Attribute: VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "1.0"
    Location:  1:1024
    Links:     1
/data/packets            Dataset {16418/Inf}
    Attribute: CLASS scalar
        Type:      5-byte null-terminated UTF-8 string
        Data:  "TABLE"
    Attribute: FIELD_0_FILL scalar
        Type:      native unsigned long long
        Data:  0
    Attribute: FIELD_0_NAME scalar
        Type:      8-byte null-terminated UTF-8 string
        Data:  "timetick"
    Attribute: FIELD_1_FILL scalar
        Type:      native unsigned short
        Data:  0
    Attribute: FIELD_1_NAME scalar
        Type:      6-byte null-terminated UTF-8 string
        Data:  "length"
    Attribute: FIELD_2_FILL scalar
        Type:      native unsigned char
        Data:  0
    Attribute: FIELD_2_NAME scalar
        Type:      4-byte null-terminated UTF-8 string
        Data:  "data"
    Attribute: NROWS scalar
        Type:      native long long
        Data:  16418
    Attribute: TITLE null
        Type:      1-byte null-terminated UTF-8 string

    Attribute: VERSION scalar
        Type:      3-byte null-terminated UTF-8 string
        Data:  "2.7"
    Location:  1:2216
    Links:     1
    Chunks:    {247} 65455 bytes
    Storage:   4350770 logical bytes, 899061 allocated bytes, 483.92% utilization
    Filter-0:  shuffle-2 OPT {265}
    Filter-1:  deflate-1 OPT {5}
    Type:      struct {
                   "timetick"         +0    native unsigned long long
                   "length"           +8    native unsigned short
                   "data"             +10   [255] native unsigned char
               } 265 bytes


Mime
View raw message