orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Phillips <da...@acz.org>
Subject Compressed data format for LZ4
Date Thu, 08 Feb 2018 05:41:26 GMT
We previously preserved an LZ4 CompressionKind and plan to implement it in
the Presto reader and writer. Unlikely Snappy, the LZ4 format does not
record the uncompressed length. Thus, when reading, we need to allocate an
output buffer that is the full compressionBlockSize. This can waste a lot
of memory when there are many streams and many open readers.

We propose to prefix the LZ4 block with the uncompressed size. I see a few
ways of doing it:

1) Variable length integer, the same as Snappy.
2) Fixed 3-byte integer, little-endian.
3) Fixed 4-byte integer, little-endian.

Option #1 is more complicated, uses more CPU to decode, and probably
doesn't save much space; buffers starting at 16kB will use 3 bytes.

Option #2 restricts the maximum size to be 16MB-1 byte. This is
ridiculously large for a per-stream buffer and not a problem as current
writers cap the buffer size at a reasonable 256kB, so it shouldn't be a
problem in practice, but it's worth calling out here.

Option #3 is flexible but in practice will waste a byte.

My vote is for option #2.

View raw message