Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of tmatthewjohn1988@gmail.com
 designates 209.85.216.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:date:message-id:subject:from:to:content-type;
        b=nkPmmc7RRSuTjvNZmu+7gLPV/XNO+l12jhRdPShBTZ3g0y7+Z7gSUwOiRdlRVm3fxd
         x6ljk6qyPah7+Z7ELg6GKIZPJeTZtbK4zj/3dCkp+VNawO0GgCRRKid7yt2AJ48L+QXz
         gR3VSvtqRm/OGNBbw6eU87KtNA164mscV4VPw=
MIME-Version: 1.0
Date: Mon, 18 Oct 2010 13:46:43 +0530
Message-ID: <AANLkTik2Z6YPurvKiSgER0_ZZYg5K=ev8gW87aiqqszT@mail.gmail.com>
Subject: Reduce side join
From: Matthew John <tmatthewjohn1988@gmail.com>
To: common-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001636310761ef1d280492dfcb8e

--001636310761ef1d280492dfcb8e
Content-Type: text/plain; charset=ISO-8859-1

Hi all,

   I am working on a join operation using Hadoop. I came across Reduce-side
join in Hadoop The Definitive Guide. As far as I understand , this technique
is all about :

1) Read the two inputs using separate mappers  and tag the two inputs using
different values such that in the Sort Shuffle phase the primary key Record
(with only one instance of a Record with the key) comes before the records
with the same foreign key.

2) In the Reduce phase , read the required portion of the 1st record to a
variable and keep on appending it to the rest of the records to follow .

My doubt is :
Is it fine if I have more than 1 set of input records (primary record
followed by the foreign records) in the same reduce phase.
For example, will this technique work if I have just one reducer running.

Regards,

Matthew John

--001636310761ef1d280492dfcb8e--