hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <bejoy...@yahoo.com>
Subject Re: Hadoop error 2 while joining two large tables
Date Wed, 16 Mar 2011 18:43:18 GMT
Hey hadoop n00b
    I second Mark's thought. But definitely you can try out re framing your 
query to get things rolling. I'm not sure on your hive Query.But still, from my 
experience with joins on huge tables (record counts in the range of hundreds of 
millions) you should give join conditions with JOIN ON clause rather than 
specifying all conditions in WHERE.

Say if you have a query this way
SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;

You can definitely re frame this query as
SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b 
ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 > b.Column2;

From my understanding Hive supports equijoins so you can't have the inequality 
conditions there within JOIN ON, inequality should come to WHERE. This approach 
has worked for me when I encountered a similar situation as yours some time ago. 
Try this out,hope it helps.

Regards
Bejoy.K.S






________________________________
From: "Sunderlin, Mark" <mark.sunderlin@teamaol.com>
To: "user@hive.apache.org" <user@hive.apache.org>
Sent: Wed, March 16, 2011 11:22:09 PM
Subject: RE: Hadoop error 2 while joining two large tables

  
hadoop n00b asks, “Is adding more nodes the solution to such problem?”
 
Whatever else answers you get, you should append “ … and add more nodes.” More 
nodes is never a bad thing ;-)
 
---
Mark E. Sunderlin
Solutions Architect |AOL Data Warehouse
P: 703-256-6935 | C: 540-327-6222
AIM: MESunderlin
22000 AOL Way | Dulles, VA | 20166
 
 
From:hadoop n00b [mailto:new2hive@gmail.com] 
Sent: Wednesday, March 16, 2011 3:33 AM
To: user@hive.apache.org
Subject: Fwd: Hadoop error 2 while joining two large tables
 
Hello,
 
I am trying to execute a query that joins two large tables (3 million and 20 
million records). I am getting the Hadoop error code 2 during execution. This 
happens mainly while the reducers are running. Sometimes the reducers complete 
100%  and then the error comes. The logs talk about running out of Heap space 
and GC overhead limit exceeding.
 
I am running a 6 node cluster with child JVM memory of 1GB.
 
Are there any parameters I could tweak to make them run? Is adding more nodes 
the solution to such problem?
 
Thanks!


      
Mime
View raw message