hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Namit Jain <>
Subject Re: Self join problem
Date Wed, 11 Nov 2009 05:07:07 GMT
I think you missed the attachment.

Which job is taking more time - join or group by ?

Can you send the data characteristics for m1 and foo1 - is it possible that there is a large
skew on aid and dt which is forcing the data to be send to a single reducer


On 11/10/09 6:35 PM, "Defenestrator" <> wrote:

I would definitely appreciate any insights on this from the list.  I tried to reduce the query
down to something that is easily understood and hive still demonstrates a pretty poor join
performance behavior on a three-node hadoop cluster.

drop table m1;
drop table foo1;

create table m1 (
mid int,
aid int,
dt string);


create table foo1 (
aid_1 int,
aid_2 int,
mid bigint,
dt bigint

set mapred.reduce.tasks=32;

insert overwrite table foo1
select m1.aid as aid_1, m2.aid as aid_2, count(1), m1.dt as dt
from m1 m1 join m1 m2 on m1.aid = m2.aid and m1.dt = m2.dt group by m1.aid, m2.aid, m1.dt;

Attached is the file I'm using that only has 100k rows.  I've looked at the benchmark (
and hive seems to be able to join much bigger data sets.  And I tried running the same query
on a single node dbms on my desktop, and it's able to return results in less than 3-minutes.
 While hive has been running for at least 20 minutes now.


On Tue, Nov 10, 2009 at 3:53 PM, Ryan LeCompte <> wrote:
Any thoughts on this? I've only had luck by reducing the data on each side of the join. Is
this something Hive might be able to improve in a future release of the query plan optimization?


On Nov 3, 2009, at 10:55 PM, Ryan LeCompte <> wrote:

I've had a similar issue with a small cluster. Is there any way that you can reduce the size
of the data being joined on both sides? If you search the forums for join issue, you will
see the thread for my issue and get some tips.


On Nov 3, 2009, at 10:45 PM, Defenestrator < <><>> wrote:

I was able to increase the number of reduce jobs manually to 32.  However, it finishes 28
of them and the other 4 has the same behavior of using 100% cpu and consuming a lot of memory.
 I'm suspecting that it might be an issue with the reduce job itself - is there a way to figure
out what these jobs are doing exactly?


On Tue, Nov 3, 2009 at 6:53 PM, Namit Jain < <><>
The number of reducers are inferred from the input data size. But, you can always overwrite
it by setting mapred.reduce.tasks

From: Defenestrator [mailto: <>  <>]
Sent: Tuesday, November 03, 2009 6:46 PM

To:  <><>
Subject: Re: Self join problem

Hi Namit,

Thanks for your suggestion.

I tried changing the query as you had suggested by moving the m1.dt = m2.dt to the on clause.
 It increased the number of reduce jobs to 2.  So now there are two processes running on two
nodes at 100% consuming a lot of memory.  Is there a reason why hive doesn't spawn more reduce
jobs for this query?

On Tue, Nov 3, 2009 at 4:47 PM, Namit Jain < <><>

Get the join condition in the on condition:

insert overwrite table foo1
select <><>  as id_1, <><>
 as id_2, count(1), m1.dt
from m1 join m2 on m1.dt=m2.dt where <><> <> <><>  and <><>
 < <><>  group by <><>
, <><> , m1.dt;

From: Defenestrator [mailto: <>  <>]
Sent: Tuesday, November 03, 2009 4:44 PM
To:  <><>
Subject: Self join problem


I'm trying to run the following query where m1 and m2 have the same data (>29M rows) on
a 3-node hadoop cluster.  I'm essentially trying to do a self join.  It ends up running 269
map jobs and 1 reduce job.  The map jobs complete but the reduce job just runs on one process
on one of the hadoop nodes at 100% cpu utilization and just slowly increases in memory consumption.
 The reduce job never goes beyond 82% complete despite letting it run for a day.

I am running on 0.5.0 based on this morning's trunk.

insert overwrite table foo1

select <><>  as id_1, <><>
 as id_2, count(1), m1.dt

from m1 join m2 where <><>  <> <><>
 and <><>  < <><>
and m1.dt = m2.dt group by <><> , <><>
, m1.dt;

Any input would be appreciated.

View raw message