hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Reduce Sort
Date Tue, 08 Apr 2008 15:53:17 GMT

There are two ways to do this.  Both of them assume that you have counted
the addresses using map-reduce and the results are in HDFS.

First, since the number of unique IP address is likely to be relatively
small, simply sorting the results using conventional sort is probably as
good as it gets.  This will take just a few lines of scripting code:

   wget $base/your-results-directory/part-00000 --output-document=a
   wget $base/your-results-directory/part-00001 --output-document=b
   sort -k1nr a b > where-you-want-the-output

It would be convenient if there were a URL that would allow you to retrieve
the concatenation of a wild-carded list of files, but the method I show
above isn't bad.

You are likely to be unhappy at the perceived impurity of this approach, but
I would ask to think about why one might use hadoop at all.  The best reason
is to get high performance on large problems.  The sorting part of this
problem is not all that big a deal and using a conventional sort is probably
the most effective approach here.

You can also do the sorting using hadoop.  Just use a mapper that moves the
count to the key and keeps the IP as the value.  I think that if you use an
IntWritable or LongWritable as the key then the default sorting would give
you ascending order.  You can also define the sort order so that you get
descending order.  Make sure you set the number of reducers to 1 so that you
only get a single output file.

If you have less than 10 million values, the conventional sort is likely to
be faster simply because of hadoop's startup time.

On 4/8/08 8:37 AM, "Natarajan, Senthil" <senthil@pitt.edu> wrote:

> Hi,
> I am new to MapReduce.
> After slightly modifying the example wordcount, to count the IP Address.
> I have two files part-00000 and part-00001 with the contents something like.
> IP Add       Count
> 1.2. 5. 42   27
> 2.8. 6. 6   24
>   8
> 7.9. 6. 9    201
> I want to sort it by IP Address count in descending order(i.e.) I would expect
> to see
> 7.9. 6. 9    201
> 1.2. 5. 42   27
> 2.8. 6. 6   24
>   8
> Could you please suggest how to do this.
> And to merge both the partitions (part-00000 and part-00001) in to one output
> file, is there any functions already available in MapReduce Framework.
> Or we need to use Java IO to do this.
> Thanks,
> Senthil

View raw message