Return-Path: Delivered-To: apmail-hive-dev-archive@www.apache.org Received: (qmail 90581 invoked from network); 6 Apr 2011 17:29:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Apr 2011 17:29:45 -0000 Received: (qmail 1683 invoked by uid 500); 6 Apr 2011 17:29:45 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 1604 invoked by uid 500); 6 Apr 2011 17:29:45 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 1596 invoked by uid 500); 6 Apr 2011 17:29:45 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 1593 invoked by uid 99); 6 Apr 2011 17:29:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Apr 2011 17:29:45 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Apr 2011 17:29:43 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id B00A1957D8 for ; Wed, 6 Apr 2011 17:29:05 +0000 (UTC) Date: Wed, 6 Apr 2011 17:29:05 +0000 (UTC) From: "Krishna Kumar (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: <1515128625.38301.1302110945717.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1357645560.38297.1302110825842.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HIVE-2097) Explore mechanisms for better compression with RC Files MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016451#comment-13016451 ] Krishna Kumar commented on HIVE-2097: ------------------------------------- Comment hijacked from HIVE-2065: He Yongqiang added a comment - 31/Mar/11 23:13 we examined column groups, and sort the data internally based on one column in one column group. (But we did not try different compressions across column groups.) Tried this with 3-4 tables, and we see ~20% storage savings on one table compared the previous RCFile. The main problems for this approach is that it is hard to find out the correct/most efficient column group definitions. One example, table tbl_1 has 20 columns, and user can define: col_1,col_2,col_11,col_13:0;col_3,col_4,col_15,col_16:1; This will put col_1, col_2,col_11, col_13 into one column group, and reorder that column group based on sorting col_1 (0 is the first column in this column group), and put col_3, col_4, col_15,col_16 into another column group, and reorder this column group based on sorting col_4, and finally put all other columns into the default column group with original order. And should be easy to allow different compression codec for different column groups. The main block issue for this approach is have a full set of utils to find out the best column group definition. > Explore mechanisms for better compression with RC Files > ------------------------------------------------------- > > Key: HIVE-2097 > URL: https://issues.apache.org/jira/browse/HIVE-2097 > Project: Hive > Issue Type: Improvement > Components: Query Processor, Serializers/Deserializers > Reporter: Krishna Kumar > Assignee: Krishna Kumar > Priority: Minor > > Optimization of the compression mechanisms used by RC File to be explored. > Some initial ideas > > 1. More efficient serialization/deserialization based on type-specific and storage-specific knowledge. > > For instance, storing sorted numeric values efficiently using some delta coding techniques > 2. More efficient compression based on type-specific and storage-specific knowledge > Enable compression codecs to be specified based on types or individual columns > 3. Reordering the on-disk storage for better compression efficiency. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira