hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <bejoy...@yahoo.com>
Subject Re: Hive Queries Performance Tuning - Map side joins, Map side aggregations, Partitioning/Clustering
Date Sun, 01 Apr 2012 21:34:38 GMT
Anand
     You can optimize pretty much all hive queries. Based on your queries you need to do
the optimizations. For example Group By has some specific way to be optimized. Some times
Distribute By comes in handy for optimizing some queries. Skew joins are good to balace
the reducer loads. etc
     Map joins are used if one of the table's involved in the join is small. For medium
sized bucketed tables you can go in for bucketed map join (with some conditions on number
of buckets and bucketed columns to join columns).

Regards
Bejoy KS


________________________________
 From: "Ladda, Anand" <lanand@microstrategy.com>
To: "user@hive.apache.org" <user@hive.apache.org> 
Sent: Sunday, April 1, 2012 11:59 PM
Subject: Hive Queries Performance Tuning - Map side joins, Map side aggregations, Partitioning/Clustering
 

 
I am trying to understand what are some of the options/settings available to tune the performance
of Hive Queries. I have seen the benefits of Map side joins and Partitioning/Clustering. However
I have yet to realize the impact map side aggregation has on query performance. I tried running
this query against with and without map-side join turned on and did not see much difference
in the execution times. The raw data in this partition is about 5.5 million. Looking for some
pointers to see what type of queries benefit from Map-side aggregation
 
 set hive.auto.convert.join=false;  
 set hive.map.aggr=false;  
Non-partitioned, non-clustered single table with where clause on date and no map side aggregation
select a11.emp_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from orderdetailrcfile
a11 where order_date ='01-01-2008' group by a11.emp_id; 400 secs 
 set hive.map.aggr=true;  
Non-partitioned, non-clustered single table with where clause with where clause on date and
map side aggregation select a11.emp_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold)
from orderdetailrcfile a11 where order_date ='01-01-2008' group by a11.emp_id; 390 secs 
 
Also is there any reason to not turn on map-side joins all the time. In my tests I have always
seen the performance either be the same or improve with map-side joins turned on. Are there
any other parameters or Hive features that can help improve the performance of Hive queries.

Thanks
Anand
Mime
View raw message