Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <BAY002-W1668D4255B67997907D4E1693710@phx.gbl>
References: <BAY002-W2333FF4F4BA5F9BE37E56FF93710@phx.gbl>
	<CAMwPWvn6oLKao86m6kNYn9MMzZF+Zh88Aa-UY6MreCji9MZ=4A@mail.gmail.com>
	<BAY002-W139F8F64118B40A49A7641A93710@phx.gbl>
	<CAJAzPKSAYpuHpizqgsH4feCv-Bvrg71Br8rtfz=updRvTu6yuQ@mail.gmail.com>
	<BAY002-W1668D4255B67997907D4E1693710@phx.gbl>
Date: Tue, 16 Oct 2012 14:17:47 +0900
Message-ID: 
 <CACQ46vHHqCViQisZ2oOuTgXqLjDuOtpMgifU+4gDgXe5ZpMQhw@mail.gmail.com>
Subject: Re: Hive Query Unable to distribute load evenly in reducers
From: =?EUC-KR?B?TmF2aXO3+b3Cv+w=?= <navis.ryu@nexr.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=047d7b10c8c9519c6504cc264649

--047d7b10c8c9519c6504cc264649
Content-Type: text/plain; charset=ISO-8859-1

How about using MapJoin?

2012/10/16 Saurabh Mishra <saurabhmishra.iitg@outlook.com>

> no there is apparently no heavy skewing. also another stats i wanted to
> point was, following is approximate table contents in this 4 table join
> query :
> tableA : 170 million (actual number, + i am also exploding these records,
> so the number could be much much higher)
> tableB:15
> tableC:45
> tableD:45
> tableE : 45
> tableF  : 14000
>
> Also i cannot put any filter condition on tableA ,situation does not
> permit so. :(
> Kindly suggest, some alternative solution or some hive configuration to
> better load distribute in the reducers
>
> > Date: Mon, 15 Oct 2012 16:29:56 +0100
>
> > Subject: Re: Hive Query Unable to distribute load evenly in reducers
> > From: philip.j.tromans@gmail.com
> > To: user@hive.apache.org
>
> >
> > Is your data heavily skewed towards certain values of a.x etc?
> >
> > On 15 October 2012 15:23, Saurabh Mishra <saurabhmishra.iitg@outlook.com>
> wrote:
> > > The queries are simple joins, something on the lines of
> > > select a, b, c, count(D) from tableA join tableB on a.x=b.y join....
> group
> > > by a, b,c;
> > >
> > >
> > >> From: liy099@gmail.com
> > >> Date: Mon, 15 Oct 2012 21:10:39 +0800
> > >> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> > >> To: user@hive.apache.org
> > >
> > >>
> > >> And your queries were?
> > >>
> > >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
> > >> <saurabhmishra.iitg@outlook.com> wrote:
> > >> > Hi,
> > >> > I am firing some hive queries joining tables containing upto
> 30millions
> > >> > records each. Since the load on the reducers is very significant in
> > >> > these
> > >> > cases, i specifically set the following parameters before executing
> the
> > >> > queries :
> > >> >
> > >> > set mapred.reduce.tasks=100;
> > >> > set hive.exec.reducers.bytes.per.reducer=500000000;
> > >> > set hive.optimize.cp=true;
> > >> >
> > >> > The number of reducer the job spouts in now 160, but despite the
> high
> > >> > number
> > >> > most of the load remains upon 1 or 2 reducers. Hence in the final
> > >> > statistics, 158 reducers go completed with 2-3 minutes of start and
> 2
> > >> > reducers took 2 hrs to run.
> > >> > Is there any way to overcome this load distribution disparity.
> > >> > Any help in this regards will be highly appreciated.
> > >> >
> > >> > Sincerely
> > >> > Saurabh Mishra
>

--047d7b10c8c9519c6504cc264649
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

How about using MapJoin?<br><br><div class=3D"gmail_quote">2012/10/16 Saura=
bh Mishra <span dir=3D"ltr">&lt;<a href=3D"mailto:saurabhmishra.iitg@outloo=
k.com" target=3D"_blank">saurabhmishra.iitg@outlook.com</a>&gt;</span><br><=
blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex">


<div><div dir=3D"ltr">no there is apparently no heavy skewing. also another=
 stats i wanted to point was, following is approximate table contents in th=
is 4 table join query : <br>tableA : 170 million (actual number, + i am als=
o exploding these records, so the number could be much much higher)<br>
tableB:15<br>tableC:45<br>tableD:45<br>tableE : 45<br>tableF=A0 : 14000<br>=
<br>Also i cannot put any filter condition on tableA ,situation does not pe=
rmit so. :( <br>Kindly suggest, some alternative solution or some hive conf=
iguration to better load distribute in the reducers<br>
<br><div><div></div>&gt; Date: Mon, 15 Oct 2012 16:29:56 +0100<div class=3D=
"im"><br>&gt; Subject: Re: Hive Query Unable to distribute load evenly in r=
educers<br></div>&gt; From: <a href=3D"mailto:philip.j.tromans@gmail.com" t=
arget=3D"_blank">philip.j.tromans@gmail.com</a><br>
&gt; To: <a href=3D"mailto:user@hive.apache.org" target=3D"_blank">user@hiv=
e.apache.org</a><div><div class=3D"h5"><br>&gt; <br>&gt; Is your data heavi=
ly skewed towards certain values of a.x etc?<br>&gt; <br>&gt; On 15 October=
 2012 15:23, Saurabh Mishra &lt;<a href=3D"mailto:saurabhmishra.iitg@outloo=
k.com" target=3D"_blank">saurabhmishra.iitg@outlook.com</a>&gt; wrote:<br>
&gt; &gt; The queries are simple joins, something on the lines of<br>&gt; &=
gt; select a, b, c, count(D) from tableA join tableB on a.x=3Db.y join.... =
group<br>&gt; &gt; by a, b,c;<br>&gt; &gt;<br>&gt; &gt;<br>&gt; &gt;&gt; Fr=
om: <a href=3D"mailto:liy099@gmail.com" target=3D"_blank">liy099@gmail.com<=
/a><br>
&gt; &gt;&gt; Date: Mon, 15 Oct 2012 21:10:39 +0800<br>&gt; &gt;&gt; Subjec=
t: Re: Hive Query Unable to distribute load evenly in reducers<br>&gt; &gt;=
&gt; To: <a href=3D"mailto:user@hive.apache.org" target=3D"_blank">user@hiv=
e.apache.org</a><br>
&gt; &gt;<br>&gt; &gt;&gt;<br>&gt; &gt;&gt; And your queries were?<br>&gt; =
&gt;&gt;<br>&gt; &gt;&gt; On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra<b=
r>&gt; &gt;&gt; &lt;<a href=3D"mailto:saurabhmishra.iitg@outlook.com" targe=
t=3D"_blank">saurabhmishra.iitg@outlook.com</a>&gt; wrote:<br>
&gt; &gt;&gt; &gt; Hi,<br>&gt; &gt;&gt; &gt; I am firing some hive queries =
joining tables containing upto 30millions<br>&gt; &gt;&gt; &gt; records eac=
h. Since the load on the reducers is very significant in<br>&gt; &gt;&gt; &=
gt; these<br>
&gt; &gt;&gt; &gt; cases, i specifically set the following parameters befor=
e executing the<br>&gt; &gt;&gt; &gt; queries :<br>&gt; &gt;&gt; &gt;<br>&g=
t; &gt;&gt; &gt; set mapred.reduce.tasks=3D100;<br>&gt; &gt;&gt; &gt; set h=
ive.exec.reducers.bytes.per.reducer=3D500000000;<br>
&gt; &gt;&gt; &gt; set hive.optimize.cp=3Dtrue;<br>&gt; &gt;&gt; &gt;<br>&g=
t; &gt;&gt; &gt; The number of reducer the job spouts in now 160, but despi=
te the high<br>&gt; &gt;&gt; &gt; number<br>&gt; &gt;&gt; &gt; most of the =
load remains upon 1 or 2 reducers. Hence in the final<br>
&gt; &gt;&gt; &gt; statistics, 158 reducers go completed with 2-3 minutes o=
f start and 2<br>&gt; &gt;&gt; &gt; reducers took 2 hrs to run.<br>&gt; &gt=
;&gt; &gt; Is there any way to overcome this load distribution disparity.<b=
r>
&gt; &gt;&gt; &gt; Any help in this regards will be highly appreciated.<br>=
&gt; &gt;&gt; &gt;<br>&gt; &gt;&gt; &gt; Sincerely<br>&gt; &gt;&gt; &gt; Sa=
urabh Mishra<br></div></div></div> 		 	   		  </div></div>
</blockquote></div><br>

--047d7b10c8c9519c6504cc264649--