Mailing-List: contact hive-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hive-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of njain@facebook.com designates
 69.63.179.25 as permitted sender)
From: Namit Jain <njain@facebook.com>
To: "hive-user@hadoop.apache.org" <hive-user@hadoop.apache.org>
CC: Ryan LeCompte <lecompte@gmail.com>
Date: Tue, 10 Nov 2009 21:07:07 -0800
Subject: Re: Self join problem
Thread-Topic: Self join problem
Thread-Index: Acpid8JPLZN02wYSRkqqvrVwT1u4RwAFQ5k4
Message-ID: <C71F867C.10D9D%njain@facebook.com>
In-Reply-To: <55e68b790911101835i7577eb5dt41659a039e20ae6c@mail.gmail.com>
Accept-Language: en-US
Content-Language: en
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_C71F867C10D9Dnjainfacebookcom_"
MIME-Version: 1.0

--_000_C71F867C10D9Dnjainfacebookcom_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I think you missed the attachment.


Which job is taking more time - join or group by ?

Can you send the data characteristics for m1 and foo1 - is it possible that=
 there is a large skew on aid and dt which is forcing the data to be send t=
o a single reducer


-namit


On 11/10/09 6:35 PM, "Defenestrator" <defenestrationism@gmail.com> wrote:

I would definitely appreciate any insights on this from the list.  I tried =
to reduce the query down to something that is easily understood and hive st=
ill demonstrates a pretty poor join performance behavior on a three-node ha=
doop cluster.

drop table m1;
drop table foo1;

create table m1 (
mid int,
aid int,
dt string);

LOAD DATA LOCAL INPATH 'm1' OVERWRITE INTO TABLE m1;

create table foo1 (
aid_1 int,
aid_2 int,
mid bigint,
dt bigint
);

set mapred.reduce.tasks=3D32;

insert overwrite table foo1
select m1.aid as aid_1, m2.aid as aid_2, count(1), m1.dt as dt
from m1 m1 join m1 m2 on m1.aid =3D m2.aid and m1.dt =3D m2.dt group by m1.=
aid, m2.aid, m1.dt;

Attached is the file I'm using that only has 100k rows.  I've looked at the=
 benchmark (http://issues.apache.org/jira/secure/attachment/12411185/hive_b=
enchmark_2009-06-18.pdf) and hive seems to be able to join much bigger data=
 sets.  And I tried running the same query on a single node dbms on my desk=
top, and it's able to return results in less than 3-minutes.  While hive ha=
s been running for at least 20 minutes now.

Thanks.

On Tue, Nov 10, 2009 at 3:53 PM, Ryan LeCompte <lecompte@gmail.com> wrote:
Any thoughts on this? I've only had luck by reducing the data on each side =
of the join. Is this something Hive might be able to improve in a future re=
lease of the query plan optimization?

Thanks,
Ryan


On Nov 3, 2009, at 10:55 PM, Ryan LeCompte <lecompte@gmail.com> wrote:

I've had a similar issue with a small cluster. Is there any way that you ca=
n reduce the size of the data being joined on both sides? If you search the=
 forums for join issue, you will see the thread for my issue and get some t=
ips.

Thanks,
Ryan


On Nov 3, 2009, at 10:45 PM, Defenestrator < <mailto:defenestrationism@gmai=
l.com><mailto:defenestrationism@gmail.com> defenestrationism@gmail.com> wro=
te:

I was able to increase the number of reduce jobs manually to 32.  However, =
it finishes 28 of them and the other 4 has the same behavior of using 100% =
cpu and consuming a lot of memory.  I'm suspecting that it might be an issu=
e with the reduce job itself - is there a way to figure out what these jobs=
 are doing exactly?

Thanks.

On Tue, Nov 3, 2009 at 6:53 PM, Namit Jain < <mailto:njain@facebook.com><ma=
ilto:njain@facebook.com>  <mailto:njain@facebook.com><mailto:njain@facebook=
.com> njain@facebook.com> wrote:
The number of reducers are inferred from the input data size. But, you can =
always overwrite it by setting mapred.reduce.tasks


From: Defenestrator [mailto: <mailto:defenestrationism@gmail.com>  <mailto:=
defenestrationism@gmail.com> defenestrationism@gmail.com]
Sent: Tuesday, November 03, 2009 6:46 PM

To:  <mailto:hive-user@hadoop.apache.org><mailto:hive-user@hadoop.apache.or=
g>  <mailto:hive-user@hadoop.apache.org><mailto:hive-user@hadoop.apache.org=
> hive-user@hadoop.apache.org
Subject: Re: Self join problem


Hi Namit,


Thanks for your suggestion.


I tried changing the query as you had suggested by moving the m1.dt =3D m2.=
dt to the on clause.  It increased the number of reduce jobs to 2.  So now =
there are two processes running on two nodes at 100% consuming a lot of mem=
ory.  Is there a reason why hive doesn't spawn more reduce jobs for this qu=
ery?


On Tue, Nov 3, 2009 at 4:47 PM, Namit Jain < <mailto:njain@facebook.com><ma=
ilto:njain@facebook.com>  <mailto:njain@facebook.com><mailto:njain@facebook=
.com> njain@facebook.com> wrote:

Get the join condition in the on condition:


insert overwrite table foo1
select m1.id <http://m1.id><http://m1.id>  as id_1, m2.id <http://m2.id><ht=
tp://m2.id>  as id_2, count(1), m1.dt
from m1 join m2 on m1.dt=3Dm2.dt where m1.id <http://m1.id><http://m1.id> <=
> m2.id <http://m2.id><http://m2.id>  and m1.id <http://m1.id><http://m1.id=
>  < m2.id <http://m2.id><http://m2.id>  group by m1.id <http://m1.id><http=
://m1.id> , m2.id <http://m2.id><http://m2.id> , m1.dt;


From: Defenestrator [mailto: <mailto:defenestrationism@gmail.com>  <mailto:=
defenestrationism@gmail.com> defenestrationism@gmail.com]
Sent: Tuesday, November 03, 2009 4:44 PM
To:  <mailto:hive-user@hadoop.apache.org><mailto:hive-user@hadoop.apache.or=
g>  <mailto:hive-user@hadoop.apache.org><mailto:hive-user@hadoop.apache.org=
> hive-user@hadoop.apache.org
Subject: Self join problem


Hello,


I'm trying to run the following query where m1 and m2 have the same data (>=
29M rows) on a 3-node hadoop cluster.  I'm essentially trying to do a self =
join.  It ends up running 269 map jobs and 1 reduce job.  The map jobs comp=
lete but the reduce job just runs on one process on one of the hadoop nodes=
 at 100% cpu utilization and just slowly increases in memory consumption.  =
The reduce job never goes beyond 82% complete despite letting it run for a =
day.


I am running on 0.5.0 based on this morning's trunk.


insert overwrite table foo1

select m1.id <http://m1.id><http://m1.id>  as id_1, m2.id <http://m2.id><ht=
tp://m2.id>  as id_2, count(1), m1.dt

from m1 join m2 where m1.id <http://m1.id><http://m1.id>  <> m2.id <http://=
m2.id><http://m2.id>  and m1.id <http://m1.id><http://m1.id>  < m2.id <http=
://m2.id><http://m2.id> and m1.dt =3D m2.dt group by m1.id <http://m1.id><h=
ttp://m1.id> , m2.id <http://m2.id><http://m2.id> , m1.dt;


Any input would be appreciated.


--_000_C71F867C10D9Dnjainfacebookcom_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML>
<HEAD>
<TITLE>Re: Self join problem</TITLE>
</HEAD>
<BODY>
<FONT SIZE=3D"4"><FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"><SPAN ST=
YLE=3D'font-size:11pt'>I think you missed the attachment.<BR>
<BR>
<BR>
Which job is taking more time &#8211; join or group by ?<BR>
<BR>
Can you send the data characteristics for m1 and foo1 &#8211; is it possibl=
e that there is a large skew on aid and dt which is forcing the data to be =
send to a single reducer<BR>
<BR>
<BR>
<BR>
-namit <BR>
<BR>
<BR>
<BR>
On 11/10/09 6:35 PM, &quot;Defenestrator&quot; &lt;defenestrationism@gmail.=
com&gt; wrote:<BR>
<BR>
</SPAN></FONT></FONT><BLOCKQUOTE><FONT SIZE=3D"4"><FONT FACE=3D"Calibri, Ve=
rdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:11pt'>I would definitely =
appreciate any insights on this from the list. =A0I tried to reduce the que=
ry down to something that is easily understood and hive still demonstrates =
a pretty poor join performance behavior on a three-node hadoop cluster.<BR>
<BR>
</SPAN></FONT></FONT><BLOCKQUOTE><FONT SIZE=3D"4"><FONT FACE=3D"Calibri, Ve=
rdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:11pt'>drop table m1;<BR>
drop table foo1;<BR>
<BR>
create table m1 (<BR>
mid int,<BR>
aid int,<BR>
dt string);<BR>
<BR>
LOAD DATA LOCAL INPATH 'm1' OVERWRITE INTO TABLE m1;<BR>
<BR>
create table foo1 (<BR>
aid_1 int,<BR>
aid_2 int,<BR>
mid bigint,<BR>
dt bigint<BR>
);<BR>
<BR>
set mapred.reduce.tasks=3D32;<BR>
<BR>
insert overwrite table foo1<BR>
select m1.aid as aid_1, m2.aid as aid_2, count(1), m1.dt as dt<BR>
from m1 m1 join m1 m2 on m1.aid =3D m2.aid and m1.dt =3D m2.dt group by m1.=
aid, m2.aid, m1.dt;<BR>
</SPAN></FONT></FONT></BLOCKQUOTE><FONT SIZE=3D"4"><FONT FACE=3D"Calibri, V=
erdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:11pt'><BR>
Attached is the file I'm using that only has 100k rows. =A0I've looked at t=
he benchmark (<a href=3D"http://issues.apache.org/jira/secure/attachment/12=
411185/hive_benchmark_2009-06-18.pdf)">http://issues.apache.org/jira/secure=
/attachment/12411185/hive_benchmark_2009-06-18.pdf)</a> and hive seems to b=
e able to join much bigger data sets. =A0And I tried running the same query=
 on a single node dbms on my desktop, and it's able to return results in le=
ss than 3-minutes. =A0While hive has been running for at least 20 minutes n=
ow.<BR>
<BR>
Thanks.<BR>
<BR>
On Tue, Nov 10, 2009 at 3:53 PM, Ryan LeCompte &lt;lecompte@gmail.com&gt; w=
rote:<BR>
</SPAN></FONT></FONT><BLOCKQUOTE><FONT SIZE=3D"4"><FONT FACE=3D"Calibri, Ve=
rdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:11pt'>Any thoughts on thi=
s? I've only had luck by reducing the data on each side of the join. Is thi=
s something Hive might be able to improve in a future release of the query =
plan optimization?<BR>
<BR>
Thanks,<BR>
Ryan<BR>
<FONT COLOR=3D"#888888"><BR>
<BR>
</FONT><BR>
On Nov 3, 2009, at 10:55 PM, Ryan LeCompte &lt;lecompte@gmail.com&gt; wrote=
:<BR>
<BR>
</SPAN></FONT></FONT><BLOCKQUOTE><FONT SIZE=3D"4"><FONT FACE=3D"Calibri, Ve=
rdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:11pt'>I've had a similar =
issue with a small cluster. Is there any way that you can reduce the size o=
f the data being joined on both sides? If you search the forums for join is=
sue, you will see the thread for my issue and get some tips.=A0<BR>
<BR>
Thanks,<BR>
Ryan<BR>
<BR>
<BR>
<BR>
On Nov 3, 2009, at 10:45 PM, Defenestrator &lt; <a href=3D"mailto:defenestr=
ationism@gmail.com">&lt;mailto:defenestrationism@gmail.com&gt;</a> defenest=
rationism@gmail.com&gt; wrote:<BR>
<BR>
</SPAN></FONT></FONT><BLOCKQUOTE><FONT SIZE=3D"4"><FONT FACE=3D"Calibri, Ve=
rdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:11pt'>I was able to incre=
ase the number of reduce jobs manually to 32. =A0However, it finishes 28 of=
 them and the other 4 has the same behavior of using 100% cpu and consuming=
 a lot of memory. =A0I'm suspecting that it might be an issue with the redu=
ce job itself - is there a way to figure out what these jobs are doing exac=
tly?<BR>
<BR>
Thanks.<BR>
<BR>
On Tue, Nov 3, 2009 at 6:53 PM, Namit Jain &lt; <a href=3D"mailto:njain@fac=
ebook.com">&lt;mailto:njain@facebook.com&gt;</a> &nbsp;<a href=3D"mailto:nj=
ain@facebook.com">&lt;mailto:njain@facebook.com&gt;</a> njain@facebook.com&=
gt; wrote:<BR>
</SPAN></FONT></FONT><BLOCKQUOTE><FONT SIZE=3D"4"><FONT FACE=3D"Calibri, Ve=
rdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:11pt'><FONT COLOR=3D"#1F4=
97D">The number of reducers are inferred from the input data size. But, you=
 can always overwrite it by setting mapred.reduce.tasks<BR>
=A0<BR>
=A0<BR>
=A0<BR>
</FONT><BR>
</SPAN><SPAN STYLE=3D'font-size:10pt'><B>From:</B> Defenestrator [mailto: &=
lt;mailto:defenestrationism@gmail.com&gt; &nbsp;&lt;mailto:defenestrationis=
m@gmail.com&gt; defenestrationism@gmail.com] <BR>
<B>Sent:</B> Tuesday, November 03, 2009 6:46 PM<BR>
<BR>
<B>To:</B> &nbsp;<a href=3D"mailto:hive-user@hadoop.apache.org">&lt;mailto:=
hive-user@hadoop.apache.org&gt;</a> &nbsp;<a href=3D"mailto:hive-user@hadoo=
p.apache.org">&lt;mailto:hive-user@hadoop.apache.org&gt;</a> hive-user@hado=
op.apache.org<BR>
<B>Subject:</B> Re: Self join problem<BR>
<BR>
=A0<BR>
Hi Namit,<BR>
<BR>
=A0<BR>
<BR>
Thanks for your suggestion.<BR>
<BR>
=A0<BR>
<BR>
I tried changing the query as you had suggested by moving the m1.dt =3D m2.=
dt to the on clause. =A0It increased the number of reduce jobs to 2. =A0So =
now there are two processes running on two nodes at 100% consuming a lot of=
 memory. =A0Is there a reason why hive doesn't spawn more reduce jobs for t=
his query?<BR>
<BR>
=A0<BR>
<BR>
On Tue, Nov 3, 2009 at 4:47 PM, Namit Jain &lt; <a href=3D"mailto:njain@fac=
ebook.com">&lt;mailto:njain@facebook.com&gt;</a> &nbsp;<a href=3D"mailto:nj=
ain@facebook.com">&lt;mailto:njain@facebook.com&gt;</a> njain@facebook.com&=
gt; wrote:<BR>
<BR>
</SPAN><FONT COLOR=3D"#1F497D"><SPAN STYLE=3D'font-size:11pt'>Get the join =
condition in the on condition:<BR>
</SPAN></FONT><SPAN STYLE=3D'font-size:10pt'><BR>
</SPAN><FONT COLOR=3D"#1F497D"><SPAN STYLE=3D'font-size:11pt'>=A0<BR>
</SPAN></FONT><SPAN STYLE=3D'font-size:10pt'>insert overwrite table foo1<BR=
>
select m1.id <a href=3D"http://m1.id">&lt;http://m1.id&gt;</a> &nbsp;as id_=
1, m2.id <a href=3D"http://m2.id">&lt;http://m2.id&gt;</a> &nbsp;as id_2, c=
ount(1), m1.dt<BR>
from m1 join m2 on m1.dt=3Dm2.dt where m1.id <a href=3D"http://m1.id">&lt;h=
ttp://m1.id&gt;</a> &lt;&gt; m2.id <a href=3D"http://m2.id">&lt;http://m2.i=
d&gt;</a> &nbsp;and m1.id <a href=3D"http://m1.id">&lt;http://m1.id&gt;</a>=
 &nbsp;&lt; m2.id <a href=3D"http://m2.id">&lt;http://m2.id&gt;</a> &nbsp;g=
roup by m1.id <a href=3D"http://m1.id">&lt;http://m1.id&gt;</a> , m2.id <a =
href=3D"http://m2.id">&lt;http://m2.id&gt;</a> , m1.dt;<BR>
</SPAN><FONT COLOR=3D"#1F497D"><SPAN STYLE=3D'font-size:11pt'>=A0<BR>
=A0<BR>
=A0<BR>
</SPAN></FONT><SPAN STYLE=3D'font-size:10pt'><BR>
<B>From:</B> Defenestrator [mailto: &lt;mailto:defenestrationism@gmail.com&=
gt; &nbsp;&lt;mailto:defenestrationism@gmail.com&gt; defenestrationism@gmai=
l.com] <BR>
<B>Sent:</B> Tuesday, November 03, 2009 4:44 PM<BR>
<B>To:</B> &nbsp;<a href=3D"mailto:hive-user@hadoop.apache.org">&lt;mailto:=
hive-user@hadoop.apache.org&gt;</a> &nbsp;<a href=3D"mailto:hive-user@hadoo=
p.apache.org">&lt;mailto:hive-user@hadoop.apache.org&gt;</a> hive-user@hado=
op.apache.org<BR>
<B>Subject:</B> Self join problem<BR>
<BR>
=A0<BR>
<BR>
Hello,<BR>
<BR>
=A0<BR>
<BR>
I'm trying to run the following query where m1 and m2 have the same data (&=
gt;29M rows) on a 3-node hadoop cluster. =A0I'm essentially trying to do a =
self join. =A0It ends up running 269 map jobs and 1 reduce job. =A0The map =
jobs complete but the reduce job just runs on one process on one of the had=
oop nodes at 100% cpu utilization and just slowly increases in memory consu=
mption. =A0The reduce job never goes beyond 82% complete despite letting it=
 run for a day.<BR>
<BR>
=A0<BR>
<BR>
I am running on 0.5.0 based on this morning's trunk.<BR>
<BR>
=A0<BR>
<BR>
insert overwrite table foo1<BR>
<BR>
select m1.id <a href=3D"http://m1.id">&lt;http://m1.id&gt;</a> &nbsp;as id_=
1, m2.id <a href=3D"http://m2.id">&lt;http://m2.id&gt;</a> &nbsp;as id_2, c=
ount(1), m1.dt<BR>
<BR>
from m1 join m2 where m1.id <a href=3D"http://m1.id">&lt;http://m1.id&gt;</=
a> &nbsp;&lt;&gt; m2.id <a href=3D"http://m2.id">&lt;http://m2.id&gt;</a> &=
nbsp;and m1.id <a href=3D"http://m1.id">&lt;http://m1.id&gt;</a> &nbsp;&lt;=
 m2.id <a href=3D"http://m2.id">&lt;http://m2.id&gt;</a> and m1.dt =3D m2.d=
t group by m1.id <a href=3D"http://m1.id">&lt;http://m1.id&gt;</a> , m2.id =
<a href=3D"http://m2.id">&lt;http://m2.id&gt;</a> , m1.dt;<BR>
<BR>
=A0<BR>
<BR>
Any input would be appreciated.<BR>
=A0<BR>
<BR>
<BR>
<BR>
</SPAN></FONT></FONT></BLOCKQUOTE></BLOCKQUOTE></BLOCKQUOTE></BLOCKQUOTE></=
BLOCKQUOTE>
</BODY>
</HTML>


--_000_C71F867C10D9Dnjainfacebookcom_--