hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saurabh Mishra <saurabhmishra.i...@outlook.com>
Subject RE: Hive Query Unable to distribute load evenly in reducers
Date Tue, 16 Oct 2012 05:53:29 GMT
by using mapjoin if you are implying setting 
set hive.auto.convert.join=true;
then this configuration i am already using, but to no avail...:(

Date: Tue, 16 Oct 2012 14:17:47 +0900
Subject: Re: Hive Query Unable to distribute load evenly in reducers
From: navis.ryu@nexr.com
To: user@hive.apache.org

How about using MapJoin?

2012/10/16 Saurabh Mishra <saurabhmishra.iitg@outlook.com>




no there is apparently no heavy skewing. also another stats i wanted to point was, following
is approximate table contents in this 4 table join query : 
tableA : 170 million (actual number, + i am also exploding these records, so the number could
be much much higher)

tableB:15
tableC:45
tableD:45
tableE : 45
tableF  : 14000

Also i cannot put any filter condition on tableA ,situation does not permit so. :( 
Kindly suggest, some alternative solution or some hive configuration to better load distribute
in the reducers


> Date: Mon, 15 Oct 2012 16:29:56 +0100
> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> From: philip.j.tromans@gmail.com

> To: user@hive.apache.org
> 
> Is your data heavily skewed towards certain values of a.x etc?
> 
> On 15 October 2012 15:23, Saurabh Mishra <saurabhmishra.iitg@outlook.com> wrote:

> > The queries are simple joins, something on the lines of
> > select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... group
> > by a, b,c;
> >
> >
> >> From: liy099@gmail.com

> >> Date: Mon, 15 Oct 2012 21:10:39 +0800
> >> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> >> To: user@hive.apache.org

> >
> >>
> >> And your queries were?
> >>
> >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
> >> <saurabhmishra.iitg@outlook.com> wrote:

> >> > Hi,
> >> > I am firing some hive queries joining tables containing upto 30millions
> >> > records each. Since the load on the reducers is very significant in
> >> > these

> >> > cases, i specifically set the following parameters before executing the
> >> > queries :
> >> >
> >> > set mapred.reduce.tasks=100;
> >> > set hive.exec.reducers.bytes.per.reducer=500000000;

> >> > set hive.optimize.cp=true;
> >> >
> >> > The number of reducer the job spouts in now 160, but despite the high
> >> > number
> >> > most of the load remains upon 1 or 2 reducers. Hence in the final

> >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2
> >> > reducers took 2 hrs to run.
> >> > Is there any way to overcome this load distribution disparity.

> >> > Any help in this regards will be highly appreciated.
> >> >
> >> > Sincerely
> >> > Saurabh Mishra
 		 	   		  

 		 	   		  
Mime
View raw message