Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id AA53D200C09 for ; Wed, 25 Jan 2017 23:13:23 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id A8ED6160B4E; Wed, 25 Jan 2017 22:13:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EB174160B3D for ; Wed, 25 Jan 2017 23:13:22 +0100 (CET) Received: (qmail 16657 invoked by uid 500); 25 Jan 2017 22:13:21 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 16648 invoked by uid 99); 25 Jan 2017 22:13:21 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Jan 2017 22:13:21 +0000 Received: from [192.168.2.108] (adsl-71-145-210-73.dsl.austtx.sbcglobal.net [71.145.210.73]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 6E5741A018B for ; Wed, 25 Jan 2017 22:13:21 +0000 (UTC) User-Agent: Microsoft-MacOutlook/f.19.0.160817 Date: Wed, 25 Jan 2017 14:13:20 -0800 Subject: Re: Parquet tables with snappy compression From: Gopal Vijayaraghavan Sender: Gopal Vijayaraghavan To: "user@hive.apache.org" Message-ID: Thread-Topic: Parquet tables with snappy compression References: In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: quoted-printable archived-at: Wed, 25 Jan 2017 22:13:23 -0000 > Has there been any study of how much compressing Hive Parquet tables with= snappy reduces storage=C2=A0space or simply the table size in quantitative term= s? http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parque= t/20 Since SNAPPY is just LZ77, I would assume it would be useful in cases of Pa= rquet leaves containing text with large common sub-chunks (like URLs or log = data). If you want to experiment with that corner case, the L_COMMENT field from T= PC-H lineitem is a good compression-thrasher. Cheers, Gopal