hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ricky Ho <rickyphyl...@yahoo.com>
Subject RE: Reduce side join
Date Mon, 18 Oct 2010 15:49:10 GMT
It doesn't require the separation of data at the file level, it only requires 
each record carry enough distinction about which source it is from.

You don't need "separate mappers", it is just the same mapper with logic that 
distinguish the record type.

For reducing memory footprint, the data source with less record of the same key 
should arrive at the reducer first.  What you describe is a special case 
(1-to-M), then the data source with non-dup key (which means the primary key) 
should arrive first.

At the reducer, it just need to store in memory the first set of data records 
(which is smaller in size) and stream out the joined output.

Compared to a non-optimized version, it reduce memory footprint from O(M + N) to 
O(min(M, N)) where M and N is the number of dup keys records for each data 
sources.  In your special 1-to-M case, it reduce O(N) to O(1).

This technique will work even with one reducer.  Of course, you should always 
challenging your design why there is only one reducer.

Reducer-side join is the most straightforward one and require less organization 
of your data.  But it is not the most efficient ones.  There are other joining 
techniques that is more efficient.  (map-side-partition-join, 
map-side-partition-merge-join, semi-join, memcache-join ...etc)

I wrote a blog on Map/Reduce algorithms that has include various joining 
techniques here at ...


-----Original Message-----
From: Matthew John [mailto:tmatthewjohn1988@gmail.com] 
Sent: Monday, October 18, 2010 1:17 AM
To: common-user@hadoop.apache.org
Subject: Reduce side join
Hi all,
   I am working on a join operation using Hadoop. I came across Reduce-side
join in Hadoop The Definitive Guide. As far as I understand , this technique
is all about :
1) Read the two inputs using separate mappers  and tag the two inputs using
different values such that in the Sort Shuffle phase the primary key Record
(with only one instance of a Record with the key) comes before the records
with the same foreign key.
2) In the Reduce phase , read the required portion of the 1st record to a
variable and keep on appending it to the rest of the records to follow .
My doubt is :
Is it fine if I have more than 1 set of input records (primary record
followed by the foreign records) in the same reduce phase.
For example, will this technique work if I have just one reducer running.
Matthew John


View raw message