hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rio <driodei...@gmail.com>
Subject Re: sort example
Date Sun, 17 May 2009 15:30:46 GMT
Thanks for the reply Peter but that's not it.
I use the comparator class to pass the -n flag but the shuffling does not
sort the keys numerically.
Tell me if this is wrong:
1. input (text file):
1324
212
123123
2332
145455
.....
2. The mapper job will spawn a process that will run my ruby code passing
each line via the stdin. My script will generate <key,value> where key =
value = line
3. Hadoop will sort the keys prior to pass them to the reducer. It will sort
them numerically because I pass the option -n to the compartor class.
4. The reducer feeds the lines into my reducer script, which behaves like
the identity class.
>From what I am seeing everything works like this expect the sorting is
not done numerically.
BTW, This is my latest command to submit the job:
hadoop jar
/home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \
-D mapred.text.key.comparator.options=-n \
-input /input \
-output /output \
-mapper sort_mapper.rb \
-file `pwd`/scripts_sort/sort_mapper.rb \
-reducer sort_reducer.rb \
-file `pwd`/scripts_sort/sort_reducer.rb

I know I could use the identity classes and get rid of the scripts. I have
tried out but I get an exception (I'll deal with it when I figure this
first).
-drd


On Sat, May 16, 2009 at 11:42 PM, Peter Skomoroch <peter.skomoroch@gmail.com
> wrote:

> I just copy and pasted that comparator option from the docs, the -n part is
> what you want in this case.
>
> On Sun, May 17, 2009 at 12:40 AM, Peter Skomoroch <
> peter.skomoroch@gmail.com
> > wrote:
>
> > 1) It is doing alphabetical sort by default, you can force Hadoop
> streaming
> > to sort numerically with:
> >
> > -D mapred.text.key.comparator.options=-k2,2nr\
> >
> > see the section "A Useful Comparator Class" in the streaming docs:
> >
> > http://hadoop.apache.org/core/docs/current/streaming.html
> > and https://issues.apache.org/jira/browse/HADOOP-2302
> >
> > 2) For the second issue, I think you will need to use 1 reducer to
> > guarantee global sort order or use another MR pass.
> >
> >
> >
> > On Sun, May 17, 2009 at 12:14 AM, David Rio <driodeiros@gmail.com>
> wrote:
> > >
> > > BTW,
> > > Basically, this is the unix equivalent to what I am trying to do:
> > > $ cat input_file.txt | sort -n
> > > -drd
> > >
> > > On Sat, May 16, 2009 at 11:10 PM, David Rio <driodeiros@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > > I am trying to sort some data with hadoop(streaming mode). The input
> > looks
> > > > like:
> > > >  $ cat small_numbers.txt
> > > > 9971681
> > > > 9686036
> > > > 2592322
> > > > 4518219
> > > > 1467363
> > > >
> > > > To send my job to the cluster I use:
> > > > hadoop jar
> > > >
> /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar
> > \
> > > > -D "mapred.reduce.tasks=2" \
> > > > -D "stream.num.map.output.key.fields=1" \
> > > > -D mapred.text.key.comparator.options=-k1,1n \
> > > > -input /input \
> > > > -output /output \
> > > > -mapper sort_mapper.rb \
> > > > -file `pwd`/scripts_sort/sort_mapper.rb \
> > > > -reducer sort_reducer.rb \
> > > > -file `pwd`/scripts_sort/sort_reducer.rb
> > > >
> > > > The mapper code basically writes key, value = input_line, input_line.
> > > > The reducer just prints the keys from the standard input.
> > > > Incase you care:
> > > >  $ cat scripts_sort/sort_*
> > > > #!/usr/bin/ruby
> > > >
> > > > STDIN.each_line {|l| puts "#{l.chomp}\t#{l.chomp}"}
> > > > ---------------------------------------------------------------------
> > > > #!/usr/bin/ruby
> > > >
> > > > STDIN.each_line { |line| puts line.split[0] }
> > > > I run the job and it completes without problems, the output looks
> like:
> > > > drio@milhouse:~/tmp $ cat output/part-00001
> > > > 1380664
> > > > 1467363
> > > > 32485
> > > > 3857847
> > > > 422538
> > > > 4354952
> > > > 4518219
> > > > 5719091
> > > > 7838358
> > > > 9686036
> > > > drio@milhouse:~/tmp $ cat output/part-00000
> > > > 1453024
> > > > 2592322
> > > > 3875994
> > > > 4689583
> > > > 5340522
> > > > 607354
> > > > 6447778
> > > > 6535495
> > > > 8647464
> > > > 9971681
> > > > These are my questions:
> > > > 1. It seems the sorting (per reducer) is working but I don't know
> why,
> > for
> > > > example,
> > > > 607354 is not the first number in the output.
> > > >
> > > > 2. How can I tell hadoop to send data to the reduces in such a way
> that
> > > > inputReduce1keys <
> > > > inputReduce2keys < ..... < inputReduceNkeys. In that way I would
> ensure
> > the
> > > > data
> > > > is fully sorted once the job is done.
> > > > I've tried also using the identity classes for the mapper and reducer
> > but
> > > > the job dies generating
> > > > exceptions about the input format.
> > > > Can anyone show me or point me to some code showing how to properly
> > perform
> > > > sorting.
> > > > Thanks in advance,
> > > > -drd
> > > >
> > > >
> >
> >
> >
> > --
> > Peter N. Skomoroch
> > 617.285.8348
> > http://www.datawrangling.com
> > http://delicious.com/pskomoroch
> > http://twitter.com/peteskomoroch
> >
>
>
>
> --
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message