orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: Compressed data format for LZ4
Date Thu, 08 Feb 2018 05:58:23 GMT
Hi David,
   In general this is probably better on dev@orc, but this works. ORC-77
(62fe9504b) implemented the LZ4 codec using airlift. The structure is the
same as the other codecs and it always uses a 3 byte header (#2).

.. Owen

On Wed, Feb 7, 2018 at 9:41 PM, David Phillips <david@acz.org> wrote:

> We previously preserved an LZ4 CompressionKind and plan to implement it in
> the Presto reader and writer. Unlikely Snappy, the LZ4 format does not
> record the uncompressed length. Thus, when reading, we need to allocate an
> output buffer that is the full compressionBlockSize. This can waste a lot
> of memory when there are many streams and many open readers.
> We propose to prefix the LZ4 block with the uncompressed size. I see a few
> ways of doing it:
> 1) Variable length integer, the same as Snappy.
> 2) Fixed 3-byte integer, little-endian.
> 3) Fixed 4-byte integer, little-endian.
> Option #1 is more complicated, uses more CPU to decode, and probably
> doesn't save much space; buffers starting at 16kB will use 3 bytes.
> Option #2 restricts the maximum size to be 16MB-1 byte. This is
> ridiculously large for a per-stream buffer and not a problem as current
> writers cap the buffer size at a reasonable 256kB, so it shouldn't be a
> problem in practice, but it's worth calling out here.
> Option #3 is flexible but in practice will waste a byte.
> My vote is for option #2.

View raw message