hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "alex gemini (JIRA)" <>
Subject [jira] [Commented] (HIVE-3086) Skewed Join Optimization
Date Wed, 27 Jun 2012 03:37:44 GMT


alex gemini commented on HIVE-3086:

for a big table logs(userid,region,timestamps,url) which has more than 10 billion record,a
middle size table users(userid,age) which has 10 million records, if there is a query :
 select count(userid) from logs a ,users b where a.userid=b.userid group by b.age.
let's say age 18-25 have more than 50% of total records and age 40-60 have only 5% of records,
age 25-50 have rest.
what we defined skewed is always by our query ,in this case skewed key is age,we can't always
assume two table are skewed by join key,right?
another example : select count(userid),to_date(timestamps,'YYYYMMDD'),age from logs where
timestamps > 2011-12-01 and timestamps < 2011-12-31 and age<25 and age>18.
because the Christmas,records in 2011-12-25 to 2011-12-31 maybe have more records than other
day in this month(this query particular assume age is not skewed for the purpose discussion).
since hive user hash partition ,let's say 6 reduce,then 2011-12-24 and 2011-12-30 will go
into same reduce which cause one reduce process much more records than others.
> Skewed Join Optimization
> ------------------------
>                 Key: HIVE-3086
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Nadeem Moidu
> During a join operation, if one of the columns has a skewed key, it can cause that particular
reducer to become the bottleneck. The following feature will address it:

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message