Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C0D0AF53C for ; Thu, 28 Mar 2013 21:11:16 +0000 (UTC) Received: (qmail 98184 invoked by uid 500); 28 Mar 2013 21:11:15 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 98106 invoked by uid 500); 28 Mar 2013 21:11:15 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 97903 invoked by uid 500); 28 Mar 2013 21:11:15 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 97863 invoked by uid 99); 28 Mar 2013 21:11:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Mar 2013 21:11:15 +0000 Date: Thu, 28 Mar 2013 21:11:15 +0000 (UTC) From: "Owen O'Malley (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-4244) Make string dictionaries adaptive in ORC MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616657#comment-13616657 ] Owen O'Malley commented on HIVE-4244: ------------------------------------- We should play with different values, but I was guessing the right cutover point for the heuristic was at a loading of 2 to 3 (50% to 33% distinct values). We aren't really going to know whether the heuristic is right or wrong unless we compare both encodings, which is much too expensive. By taking a good guess after looking at the start of the stripe, we can get good performance most of the time. > Make string dictionaries adaptive in ORC > ---------------------------------------- > > Key: HIVE-4244 > URL: https://issues.apache.org/jira/browse/HIVE-4244 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Reporter: Owen O'Malley > Assignee: Kevin Wilfong > > The ORC writer should adaptively switch between dictionary and direct encoding. I'd propose looking at the first 100,000 values in each column and decide whether there is sufficient loading in the dictionary to use dictionary encoding. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira