arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Best way to store ragged packet data in Parquet files
Date Tue, 03 Nov 2020 20:49:46 GMT
Hi Jason,
At least as a first pass I would try to avoid the padding and storing the
length separately in Parquet.  Using one column for timestamp and one
column of bytes for the data is what I would try first.  If there is any
structure to the packets splitting them into the structure could also help.

-Micah

On Tue, Nov 3, 2020 at 12:41 PM Jason Sachs <jmsachs@gmail.com> wrote:

> (reposted here; I had posted to dev@arrow by mistake)
>
> Hi all--
>
> I've been getting started with Parquet as a storage alternative to HDF5
> and it has a lot of attractive quantities including compression flexibility
> efficiency.
>
> But I'm stumped for storage efficiency in Parquet with one type of data
> that I have.
>
> This is a large series of "ragged" packets arriving as a stream, where
> each packet consists of up to 255 bytes of binary data. The vast majority
> of the packets have lengths between 96 and 112 bytes. I need to store each
> of them with a 64-bit timestamp.
>
> I can get a good storage efficiency with HDF5 with the following table
> schema using pytables:
>
> class StoredPacket(pt.IsDescription):
>     timetick = pt.UInt64Col(pos=0)
>     length = pt.UInt16Col(pos=1)
>     data = pt.UInt8Col(pos=2,shape=(255,))
>
> This stores packet data as an array of uint8 with length 255. I zero-pad
> the packet to length 255 and store the length as well in a separate column.
>
> I have created a sample file in a Github gist:
> https://gist.github.com/jason-sachs/aa6dbdaced806bb76bc7a347dfc303dc (see
> test1.h5) along with a Python script convert_test1.py that converts it to a
> Pandas DataFrame and stores it via Parquet. But the Parquet files are
> almost twice as large as the .h5 file no matter what storage technique I
> use; brotli is best but slow, and zstd is almost as good as brotli but much
> faster.
>
> Any suggestions on how I might improve storage efficiency in Parquet? I
> have a lot of flexibility with how I can store the data; my only
> requirement is that I can retrieve the data packets quickly from the
> storage file. I offer this sample file as a test case.
>
> (py3) C:\tmp\git\dv\test-h5-gist>python convert_test1.py
> Table overview:
>        timetick  length                                               data
> 0            16      99  b'\x00\x00\x00\x98:B\x1a\xbev\x90\xb2\x00\x00\...
> 1            32      99  b'\x01\x08\x00\xbf:\x8b\x1a{r=\xb2\x88\x00\t\x...
> 2            48      99  b'\x02\x10\x00\xe7:\x9c\x1c\x1at:\xb3\x10\x01\...
> 3            64      99  b"\x03\x18\x00\x0f;\x16\x1bOt|\xb2\x98\x01\x19...
> 4            80      99  b'\x04 \x007;c\x1b\xddt~\xb2 \x02!\x00<;x\x1a\...
> ..         ...     ...                                                ...
> 16413    262080      99  b'{\xd8\xff\x1d+\xe6\xc5H)r\xc1X\xfd\xd9\xff +...
> 16414    262096      99  b'|\xe0\xff6+g\xc5A,\x0c\xc3\xe0\xfd\xe1\xff9+...
> 16415    262112      99  b'}\xe8\xffN+\xd3\xc4")D\xc2h\xfe\xe9\xffQ+M\x...
> 16416    262128      99  b"~\xf0\xffg+=\xc5E';\xc2\xf0\xfe\xf1\xffj+\xf...
> 16417    262144      99  b"\x7f\xf8\xff\x81+\x13\xc4\xdd'\x15\xc2x\xff\...
>
> [16418 rows x 3 columns]
>
> Packets with tags >= 128:
>        timetick  length                                               data
> 179        2864      36  b"\xca'Twas brillig, and the slithy toves\x00\...
> 307        4896      35  b'\xca  Did gyre and gimble in the wabe:\x00\x...
> 340        5408      30  b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
> 362        5744      31  b'\xca  And the mome raths outgrabe.\x00\x00\x...
> 651       10352       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 1403      22368      32  b'\xca"Beware the Jabberwock, my son!\x00\x00\...
> 1741      27760      44  b'\xca  The jaws that bite, the claws that cat...
> 2115      33728      33  b'\xcaBeware the Jubjub bird, and shun\x00\x00...
> 2162      34464      30  b'\xca  The frumious Bandersnatch!"\x00\x00\x0...
> 2278      36304       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 2405      38320      34  b'\xcaHe took his vorpal sword in hand:\x00\x0...
> 2675      42624      41  b'\xca  Long time the manxome foe he sought --...
> 2896      46144      33  b'\xcaSo rested he by the Tumtum tree,\x00\x00...
> 3611      57568      31  b'\xca  And stood awhile in thought.\x00\x00\x...
> 4089      65200       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 5231      83456      36  b'\xcaAnd, as in uffish thought he stood,\x00\...
> 5236      83520      38  b'\xca  The Jabberwock, with eyes of flame,\x0...
> 5427      86560      40  b'\xcaCame whiffling through the tulgey wood,\...
> 6904     110176      26  b'\xca  And burbled as it came!\x00\x00\x00\x0...
> 7003     111744       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 7286     116256      44  b'\xcaOne, two! One, two! And through and thro...
> 8226     131280      39  b'\xca  The vorpal blade went snicker-snack!\x...
> 8370     133568      35  b'\xcaHe left it dead, and with its head\x00\x...
> 8849     141216      27  b'\xca  He went galumphing back.\x00\x00\x00\x...
> 10326    164832       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 11867    189472      37  b'\xca"And, has thou slain the Jabberwock?\x00...
> 12392    197856      35  b'\xca  Come to my arms, my beamish boy!\x00\x...
> 12936    206544      34  b"\xcaO frabjous day! Callooh! Callay!'\x00\x0...
> 13794    220256      26  b'\xca  He chortled in his joy.\x00\x00\x00\x0...
> 13905    222016       1  b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
> 14690    234560      36  b"\xca'Twas brillig, and the slithy toves\x00\...
> 15317    244576      35  b'\xca  Did gyre and gimble in the wabe;\x00\x...
> 15840    252928      30  b'\xcaAll mimsy were the borogoves,\x00\x00\x0...
> 16339    260896      31  b'\xca  And the mome raths outgrabe.\x00\x00\x...
>
> (py3) C:\tmp\git\dv\test-h5-gist>ls -l test1.*
> -rw-rw-rw-   1 user     group      908773 Nov  2 13:07 test1.h5
> -rw-rw-rw-   1 user     group     1611025 Nov  2 13:35 test1.pq
>
> (py3) C:\tmp\git\dv\test-h5-gist>h5ls -v -r test1.h5
> Opened "test1.h5" with sec2 driver.
> /                        Group
>     Attribute: CLASS scalar
>         Type:      5-byte null-terminated UTF-8 string
>         Data:  "GROUP"
>     Attribute: PYTABLES_FORMAT_VERSION scalar
>         Type:      3-byte null-terminated UTF-8 string
>         Data:  "2.1"
>     Attribute: TITLE null
>         Type:      1-byte null-terminated UTF-8 string
>
>     Attribute: VERSION scalar
>         Type:      3-byte null-terminated UTF-8 string
>         Data:  "1.0"
>     Location:  1:96
>     Links:     1
> /data                    Group
>     Attribute: CLASS scalar
>         Type:      5-byte null-terminated UTF-8 string
>         Data:  "GROUP"
>     Attribute: TITLE null
>         Type:      1-byte null-terminated UTF-8 string
>
>     Attribute: VERSION scalar
>         Type:      3-byte null-terminated UTF-8 string
>         Data:  "1.0"
>     Location:  1:1024
>     Links:     1
> /data/packets            Dataset {16418/Inf}
>     Attribute: CLASS scalar
>         Type:      5-byte null-terminated UTF-8 string
>         Data:  "TABLE"
>     Attribute: FIELD_0_FILL scalar
>         Type:      native unsigned long long
>         Data:  0
>     Attribute: FIELD_0_NAME scalar
>         Type:      8-byte null-terminated UTF-8 string
>         Data:  "timetick"
>     Attribute: FIELD_1_FILL scalar
>         Type:      native unsigned short
>         Data:  0
>     Attribute: FIELD_1_NAME scalar
>         Type:      6-byte null-terminated UTF-8 string
>         Data:  "length"
>     Attribute: FIELD_2_FILL scalar
>         Type:      native unsigned char
>         Data:  0
>     Attribute: FIELD_2_NAME scalar
>         Type:      4-byte null-terminated UTF-8 string
>         Data:  "data"
>     Attribute: NROWS scalar
>         Type:      native long long
>         Data:  16418
>     Attribute: TITLE null
>         Type:      1-byte null-terminated UTF-8 string
>
>     Attribute: VERSION scalar
>         Type:      3-byte null-terminated UTF-8 string
>         Data:  "2.7"
>     Location:  1:2216
>     Links:     1
>     Chunks:    {247} 65455 bytes
>     Storage:   4350770 logical bytes, 899061 allocated bytes, 483.92%
> utilization
>     Filter-0:  shuffle-2 OPT {265}
>     Filter-1:  deflate-1 OPT {5}
>     Type:      struct {
>                    "timetick"         +0    native unsigned long long
>                    "length"           +8    native unsigned short
>                    "data"             +10   [255] native unsigned char
>                } 265 bytes
>
>

Mime
View raw message