hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Lewis <lordjoe2...@gmail.com>
Subject Newbie - question - how do I use Hadoop to sort a very large file
Date Wed, 23 Jun 2010 17:15:12 GMT
Assume I have a large file called *BigData.unsorted*  ( say 500GB)
consisting of lines of text. Assume that these lines are in random order -
I understand how to assign a key to lines and that Hadoop will pass the
lines to my reducers in order of that key.

Now assume I want a single file called *BigData.sorted*  with the lines in
the order of the keys.

I think I understand how to get files part00000, part000001 ,,, but not
1) How I get just the lines from the reducer not the keys
2) How I  make the reducer generate a file with the name that I want "*
BigData.sorted"*
*3) How without using a single reducer instance I get a single output file
or is a single reducer the right choice for this task.*
*
*
*Also it would be very nice if the output of the reducer were compressed -
say BigData.sorted.gz *
*
*
*Any suggestions
*--
Steven M. Lewis PhD
Institute for Systems Biology
Seattle WA

Mime
View raw message