hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bejoy...@yahoo.com
Subject Re: Hadoop error 2 while joining two large tables
Date Thu, 17 Mar 2011 14:34:11 GMT
Try out CDH3b4 it has hive 0.7 and the latest of other hadoop tools. When you work with open
source it is definitely a good practice to upgrade those with latest versions. With newer
versions bugs would be minimal , performance would be better and you get more functionalities.
Your query looks fine an upgrade of hive could sort things out. 
Regards
Bejoy K S

-----Original Message-----
From: Edward Capriolo <edlinuxguru@gmail.com>
Date: Thu, 17 Mar 2011 08:51:05 
To: user@hive.apache.org<user@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: Re: Hadoop error 2 while joining two large tables

I am pretty sure the cloudera distro has an upgrade path to a more recent hive.

On Thursday, March 17, 2011, hadoop n00b <new2hive@gmail.com> wrote:
> Hello All,
>
> Thanks a lot for your response. To clarify a few points -
>
> I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of Hive as we
have to use Cloudera distro only.
>
> All records in the smaller table have at least one record in the larger table (of course
a few exceptions could be there but only a few).
>
> The join is using ON clause. The query is something like -
>
> select ...
> from
> (
>   (select ... from smaller_table)
>   join
>   (select from larger_table)
>   on (smaller_table.col = larger_table.col)
> )
>
> I will try out setting mapred.child.java.opts -Xmx to a higher value and let you know.
>
> Is there a pattern or rule of thumb to follow on when to add more nodes?
>
> Thanks again!
>
> On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong <swong@netflix.com> wrote:
>
>
>
> In addition, put the smaller table on the left-hand side of a JOIN:
>
> SELECT ... FROM small_table JOIN large_table ON ...
>
>
>
>
> From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
> Sent: Wednesday, March 16, 2011 11:43 AM
>
> To: user@hive.apache.org
> Subject: Re: Hadoop error 2 while joining two large tables
>
>
>
>
>
>
> Hey hadoop n00b
>     I second Mark's thought. But definitely you can try out re framing your query
to get things rolling. I'm not sure on your hive Query.But still, from my experience with
joins on huge tables (record counts in the range of hundreds of millions) you should give
join conditions with JOIN ON clause rather than specifying all conditions in WHERE.
>
> Say if you have a query this way
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
> a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;
>
> You can definitely re frame this query as
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
> ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 > b.Column2;
>
> From my understanding Hive supports equijoins so you can't have the inequality conditions
there within JOIN ON, inequality should come to WHERE. This approach has worked for me when
I encountered a similar situation as yours some time ago. Try this out,hope it helps.
>
> Regards
> Bejoy.K.S
>
>
>
>
>
>
>
>
> From: "Sunderlin, Mark" <mark.sunderlin@teamaol.com>
> To: "user@hive.apache.org" <user@hive.apache.org>
> Sent: Wed, March 16, 2011 11:22:09 PM
> Subject: RE: Hadoop error 2 while joining two large tables
>
>
>
>
> hadoop n00b asks, “Is adding more nodes the solution to such problem?”
>
> Whatever else answers you get, you should append “ … and add more nodes.” More
nodes is never a bad thing ;-)
>
>
> ---
> Mark E. Sunderlin
> Solutions Architect |AOL Data Warehouse
>
> P: 703-256-6935 | C: 540-327-6222
>
> AIM: MESunderlin
> 22000 AOL Way
Mime
View raw message