hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From samir das mohapatra <samir.help...@gmail.com>
Subject Re: Hadoop - Distributed sorting
Date Tue, 15 May 2012 18:35:44 GMT
  Steps to do this:
1) Map: It will only define the key value for each number
 2) Combiner : To sort locally  over chunk of dataset .
 3) Reducer: It will sort after over whole chunk globally-------------->
OUT PUT as sorted

Note: set combiner and reducer as Same class.

  Let us assume that our data set (integers) is constrained between 100 to
200 and we have 5 files each containing 1000 random integers between 100
and 200 (so a total of 5000 integers between 100 and 200). We read each
file into a Map and then in the Reduce phase, we produce a final Map which
contains the count of all the integers. Now if we sort all the integers
from the final Map and output it
into a list data structure in the form of <Integer, Count> then we have
sorted all the data (see figure below). Aside : In Java, you don’t even
have to come up with the data-structure that I am talking about, if you
just use a TreeMap<http://java.sun.com/javase/6/docs/api/index.html?java/util/TreeMap.html>in
the final Reduce phase, then all the keys (i.e. data) are already
as long as the key type (e.g. String, Integer, etc.) implements the
Hadoop <http://hadoop.apache.org/> has something similar called
I am using a TreeMap that takes Strings as keys in

On Tue, May 15, 2012 at 11:31 PM, @dataElGrande <markydaley88@gmail.com>wrote:

> Check out Pentaho's howto's when dealing with Hadoop or NoSQL or anything
> big
> data related. http://wiki.pentaho.com/display/BAD/How+To%27s
> madhu_sushmi wrote:
> >
> > Hi,
> > I need to implement distributed sorting using Hadoop. I am quite new to
> > Hadoop and I am getting confused. If I want to implement Merge sort, what
> > my Map and reduce should be doing. ? Should all the sorting happen at
> > reduce side?
> >
> > Please help. This is an urgent requirement. Please guide me.
> >
> > Thanks,
> > Madhu
> >
> --
> View this message in context:
> http://old.nabble.com/Hadoop---Distributed-sorting-tp32876784p33849704.html
> Sent from the Hadoop core-dev mailing list archive at Nabble.com.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message