hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bejoy KS" <bejoy.had...@gmail.com>
Subject Re: Data locality of map-side join
Date Tue, 23 Oct 2012 06:21:56 GMT
Hi Sigurd

Mapside joins are efficiently implemented in Hive and Pig. I'm talking in terms of how mapside
joins are implemented in hive.

In map side join, the smaller data set is first loaded into DistributedCache. The larger dataset
is streamed as usual and the smaller dataset in memory. For every record in larger data set
the look up is made in memory on the smaller set and there by joins are done.

In later versions of hive the hive framework itself intelligently determines the smaller data
set. In older versions you can specify the smaller data set using some hints in query.

Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
Date: Mon, 22 Oct 2012 22:29:15 
To: <user@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Data locality of map-side join

Hi guys,

I've been trying to figure out whether a map-side join using the 
join-package does anything clever regarding data locality with respect 
to at least one of the partitions to join. To be more specific, if I 
want to join two datasets and some partition of dataset A is larger than 
the corresponding partition of dataset B, does Hadoop account for this 
and try to ensure that the map task is executed on the datanode storing 
the bigger partition thus reducing data transfer (if the other partition 
does not happen to be located on that same datanode)? I couldn't 
conclude the one or the other behavior from the source code and I 
couldn't find any documentation about this detail.

Thanks for clarifying!
View raw message