Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-dev@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of samir.helpdoc@gmail.com
 designates 209.85.216.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <33849704.post@talk.nabble.com>
References: <32876784.post@talk.nabble.com>
	<33849704.post@talk.nabble.com>
Date: Wed, 16 May 2012 00:05:44 +0530
Message-ID: 
 <CAG4QEv4+SVGw=4mO2NV_R4A2sawG2-XOX1mVCDczJ=zcSktJqQ@mail.gmail.com>
Subject: Re: Hadoop - Distributed sorting
From: samir das mohapatra <samir.helpdoc@gmail.com>
To: common-dev@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf3074b37e730d7304c0177837

--20cf3074b37e730d7304c0177837
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi
  Steps to do this:
1) Map: It will only define the key value for each number
 2) Combiner : To sort locally  over chunk of dataset .
 3) Reducer: It will sort after over whole chunk globally-------------->
OUT PUT as sorted

Note: set combiner and reducer as Same class.

Example:
  Let us assume that our data set (integers) is constrained between 100 to
200 and we have 5 files each containing 1000 random integers between 100
and 200 (so a total of 5000 integers between 100 and 200). We read each
file into a Map and then in the Reduce phase, we produce a final Map which
contains the count of all the integers. Now if we sort all the integers
from the final Map and output it
into a list data structure in the form of <Integer, Count> then we have
sorted all the data (see figure below). Aside : In Java, you don=92t even
have to come up with the data-structure that I am talking about, if you
just use a TreeMap<http://java.sun.com/javase/6/docs/api/index.html?java/ut=
il/TreeMap.html>in
the final Reduce phase, then all the keys (i.e. data) are already
sorted
as long as the key type (e.g. String, Integer, etc.) implements the
Comparable<http://java.sun.com/javase/6/docs/api/index.html?java/lang/Compa=
rable.html>interface
(
Hadoop <http://hadoop.apache.org/> has something similar called
WritableComparable<http://hadoop.apache.org/common/docs/current/api/org/apa=
che/hadoop/io/WritableComparable.html>and
I am using a TreeMap that takes Strings as keys in
Reducer<http://code.google.com/p/dalalstreet/source/browse/trunk/MapReduce/=
src/org/karticks/mapreduce/Reducer.java>


Thanks
   Samir
On Tue, May 15, 2012 at 11:31 PM, @dataElGrande <markydaley88@gmail.com>wro=
te:

>
> Check out Pentaho's howto's when dealing with Hadoop or NoSQL or anything
> big
> data related. http://wiki.pentaho.com/display/BAD/How+To%27s
>
>
> madhu_sushmi wrote:
> >
> > Hi,
> > I need to implement distributed sorting using Hadoop. I am quite new to
> > Hadoop and I am getting confused. If I want to implement Merge sort, wh=
at
> > my Map and reduce should be doing. ? Should all the sorting happen at
> > reduce side?
> >
> > Please help. This is an urgent requirement. Please guide me.
> >
> > Thanks,
> > Madhu
> >
>
> --
> View this message in context:
> http://old.nabble.com/Hadoop---Distributed-sorting-tp32876784p33849704.ht=
ml
> Sent from the Hadoop core-dev mailing list archive at Nabble.com.
>
>

--20cf3074b37e730d7304c0177837--