hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-7074) The reducer parallelism should be a prime number for better stride protection
Date Fri, 16 May 2014 11:15:51 GMT
Gopal V created HIVE-7074:
-----------------------------

             Summary: The reducer parallelism should be a prime number for better stride protection
                 Key: HIVE-7074
                 URL: https://issues.apache.org/jira/browse/HIVE-7074
             Project: Hive
          Issue Type: Improvement
          Components: Statistics
            Reporter: Gopal V
            Assignee: Gopal V
         Attachments: HIVE-7074.1.patch

The current hive reducer parallelism results in stride issues with key distribution.

a JOIN generating even numbers will get strided onto only some of the reducers.

The probability of distribution skew is controlled by the number of common factors shared
by the hashcode of the key and the number of buckets.

Using a prime number within the reducer estimation will cut that probability down by a significant
amount.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message