arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Le Dem (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
Date Wed, 26 Oct 2016 16:18:58 GMT

    [ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15608905#comment-15608905
] 

Julien Le Dem commented on ARROW-300:
-------------------------------------

I'm thinking that we don't really need to compress each buffer independently and compression
could be just an encapsulation at the transport level. It sounds like we don't want to exchange
compressed buffers in memory (without sending them on the wire/disk).

In Parquet, columns can be decompressed independently because they can be retrieved independently.
In Arrow, the entire RecordBatch corresponds to a request and will be entirely compressed
and decompressed every-time. Which means we can just have the entire batch compressed together.

For simplicity I'd vote to not have compression in the Schema metadata.
https://github.com/apache/arrow/blob/2f84493371bd8fae30b8e042984c9d6ba5419c5f/format/Message.fbs#L186
That's one less thing to worry about for implementors.

We can have compression in transport level (RPC, file format, ...)
As for the supported compressors I would vote for SNAPPY and GZIP (zlib) to start with as
they provide the 2 options you describe (higher comp or higher throughput) and SNAPPY is easier
to use from Java than LZO (lz4).

> [Format] Add buffer compression option to IPC file format
> ---------------------------------------------------------
>
>                 Key: ARROW-300
>                 URL: https://issues.apache.org/jira/browse/ARROW-300
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Format
>            Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data buffers themselves
as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer compression setting
in the file Footer. Probably only two compressors worth supporting out of the box would be
zlib (higher compression ratios) and lz4 (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message