hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Conwell <j...@iamjohn.me>
Subject Re: Sorting text data
Date Mon, 30 Jan 2012 16:40:53 GMT
If you use the TextInputFormat is your mapreduce job's input format, then
Hadoop doesn't need your input data to be in a sequence file.  It will read
your text file, and call the mapper for each line in the text file (\n
delimited), where the key value is the byte offset of that line from the
beginning of the file, and the value is the text value of that line.

In the mapper, if you set the output key to the mapper's input value (the
text you want sorted), than hadoop will automatically sort the text as it
figures out what key/value mapper output pairs to send to what reducers as
input.  You can then just dump the reducer input straight to the reducer
output without any data manipulation.  Make sure your reducer output format
is set to TextOutputFormat.

On Mon, Jan 30, 2012 at 7:11 AM, sangroya <sangroyaamit@gmail.com> wrote:

> Hello,
> I have a large amount of text file 1GB, that I want to sort. So far, I know
> of hadoop examples that takes sequence file as an input to sort program.
> Does anyone know of any implementation that uses text data as input?
> Thanks,
> Amit
> -----
> Sangroya
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Sorting-text-data-tp3700231p3700231.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


John C

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message