hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Grover <mgro...@oanda.com>
Subject Re: Hive Reducers hanging - interesting problem - skew ?
Date Sun, 11 Dec 2011 21:53:38 GMT
Hi jS,
Sorry about the delayed response. Did you make any progress?

Based on my understanding, you are right. If you have 2 big collections of records being processed
in Stage 2, map join may not work.

At this point, I will say 2 things though:
1) If you are relying on joins of big collections (potentially skewed) a lot, you may want
to re-think your design.
2) If you are happy with the design you presently have, you could look into bucketed and sorted
data in the tables being joined and leverage something like bucketed map join or bucketed
sorted merge join.

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgrover@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "john smith" <js1987.smith@gmail.com>
To: user@hive.apache.org
Sent: Tuesday, December 6, 2011 7:01:26 AM
Subject: Re: Hive Reducers hanging - interesting problem - skew ?

Hi Mark, 


Thanks for your response. I tried skew optimization and I also saw the video by Lin and Namit.
From what I understand about skew join, instead of a single go , they divide it into 2 stages.



Stage1 
Join non-skew pairs. and write the skew pairs into temporary files on HDFS. 


Stage 2 
Do a Map-Join of the files by copying smaller file into mappers of larger file. 


I have a doubt here. How can they be so sure that MapJoin works in stage 2? The files can
be so large that they donot fit into the memory and join is impossible. Am I wrong? 


I also ran the query with skew optimized and as expected, none of the the pairs got joined
in the stage 1 and all of them got written into the HDFS. (They are huge) 


Now in the stage2 , Hive is trying to perform a map-join on these large tables and my Map
phase in stage 2 is stuck at 0.13% after 6 hours and 2 of my machines went down. I had to
kill the job finally. 


The size of each table is just 2GB which is way smaller than what Hadoop eco system can handle.



So is there anyway I can join these tables in Hive? Any thoughts ? 




Thanks, 
jS 





On Tue, Dec 6, 2011 at 3:39 AM, Mark Grover < mgrover@oanda.com > wrote: 


jS, 
Check out if this helps: 
http://search-hadoop.com/m/l1usr1MAHX32&subj=Re+Severely+hit+by+curse+of+last+reducer+




Mark Grover, Business Intelligence Analyst 
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgrover@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 




----- Original Message ----- 
From: "john smith" < js1987.smith@gmail.com > 
To: user@hive.apache.org 
Sent: Monday, December 5, 2011 4:38:14 PM 
Subject: Hive Reducers hanging - interesting problem - skew ? 

Hi list, 

I am trying to run a Join query on my 10 node cluster. My query looks as follows 

select * from A JOIN B on (A.a = B.b) 

size of A = 15 million rows 
size of B = 1 million rows 

The problem is A.a and B.b has around 25-30 distinct values per column which implies that
they have high selectivities and the reducers are bulky. 

However the performance hit is so horrible that , ALL my reducers hang @ 75% for 6 hours and
doesn't move further. 

The only thing that log shows up is "Join operator - forwarding rows ---------------<Huge
number>" kinds of logs for all this long. What does this mean ? 
There is no swapping happening and the CPU % is constantly around 40% for all this time (observed
through Ganglia) . 

Any way I can solve this problem? Can anyone help me with this? 

Thanks, 
jS 




Mime
View raw message