Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8A0E2200BC0 for ; Tue, 15 Nov 2016 20:17:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 88B77160B03; Tue, 15 Nov 2016 19:17:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D28B9160AF2 for ; Tue, 15 Nov 2016 20:17:04 +0100 (CET) Received: (qmail 81436 invoked by uid 500); 15 Nov 2016 19:16:58 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 81365 invoked by uid 99); 15 Nov 2016 19:16:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Nov 2016 19:16:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 8B4642C0AFA for ; Tue, 15 Nov 2016 19:16:58 +0000 (UTC) Date: Tue, 15 Nov 2016 19:16:58 +0000 (UTC) From: "Uwe L. Korn (JIRA)" To: dev@arrow.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 15 Nov 2016 19:17:05 -0000 [ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15667983#comment-15667983 ] Uwe L. Korn edited comment on ARROW-300 at 11/15/16 7:16 PM: ------------------------------------------------------------- Given my latest (sadly internal) performance tests, I'm not so sure about the benefit of a compressed arrow file format. For me the main distinction is that Parquet provides efficient storage (with the tradeoff of not being to randomly access a single row) and Arrow random access, both for columnar data. The one point where I see an Arrow file format as beneficial is where you need random access to its data but cannot load it fully into RAM but instead use a memory mapped file. If you add compression (either column-wise or whole-file level), you cannot memorymap it anymore. The only point where I can see that having columnar compression for Arrow batches is better than on the whole batch layer is that it actually produces better compression behaviour. This means that doing compression on a per-column basis can be parallelised independently of the underyling algorithm thus leading to better CPU usage. Furthermore the compression may be better if done on a column level (with a sufficient number of rows) as the data inside a column is very similar thus leading to smaller compression dictionaries and better compresssion ratios at the end. Both things mentioned are just assumptions that should be tested before being implemented. was (Author: xhochy): Given my latest (sadly internal) performance tests, I'm not so sure about the benefit of a compressed arrow file format. For me the main distinction is that Parquet provides efficient storage (with the tradeoff of not being to randomly access a single row) and Arrow random access, both for columnar data. The one point where I see an Arrow file format as beneficial is where you need random access to its data but cannot load it fully into RAM but instead use a memory mapped file. If you add compression (either column-wise or whole-file level), you cannot memorymap it anymore. The only point where I can see that having columnar compression for Arrow batches is better than on the whole file layer is that it actually produces better compression behaviour. This means that doing compression on a per-column basis can be parallelised independently of the underyling algorithm thus leading to better CPU usage. Furthermore the compression may be better if done on a column level (with a sufficient number of rows) as the data inside a column is very similar thus leading to smaller compression dictionaries and better compresssion ratios at the end. Both things mentioned are just assumptions that should be tested before being implemented. > [Format] Add buffer compression option to IPC file format > --------------------------------------------------------- > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format > Reporter: Wes McKinney > > It may be useful if data is to be sent over the wire to compress the data buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer compression setting in the file Footer. Probably only two compressors worth supporting out of the box would be zlib (higher compression ratios) and lz4 (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v6.3.4#6332)