hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-4244) Make string dictionaries adaptive in ORC
Date Thu, 28 Mar 2013 21:11:15 GMT

    [ https://issues.apache.org/jira/browse/HIVE-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616657#comment-13616657
] 

Owen O'Malley commented on HIVE-4244:
-------------------------------------

We should play with different values, but I was guessing the right cutover point for the heuristic
was at a loading of 2 to 3 (50% to 33% distinct values).

We aren't really going to know whether the heuristic is right or wrong unless we compare both
encodings, which is much too expensive. By taking a good guess after looking at the start
of the stripe, we can get good performance most of the time.
                
> Make string dictionaries adaptive in ORC
> ----------------------------------------
>
>                 Key: HIVE-4244
>                 URL: https://issues.apache.org/jira/browse/HIVE-4244
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Kevin Wilfong
>
> The ORC writer should adaptively switch between dictionary and direct encoding. I'd propose
looking at the first 100,000 values in each column and decide whether there is sufficient
loading in the dictionary to use dictionary encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message