hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Skomoroch <peter.skomor...@gmail.com>
Subject Re: sort example
Date Sun, 17 May 2009 04:40:07 GMT
1) It is doing alphabetical sort by default, you can force Hadoop streaming
to sort numerically with:

-D mapred.text.key.comparator.options=-k2,2nr\

see the section "A Useful Comparator Class" in the streaming docs:

http://hadoop.apache.org/core/docs/current/streaming.html
and https://issues.apache.org/jira/browse/HADOOP-2302

2) For the second issue, I think you will need to use 1 reducer to guarantee
global sort order or use another MR pass.


On Sun, May 17, 2009 at 12:14 AM, David Rio <driodeiros@gmail.com> wrote:
>
> BTW,
> Basically, this is the unix equivalent to what I am trying to do:
> $ cat input_file.txt | sort -n
> -drd
>
> On Sat, May 16, 2009 at 11:10 PM, David Rio <driodeiros@gmail.com> wrote:
>
> > Hi,
> > I am trying to sort some data with hadoop(streaming mode). The input
looks
> > like:
> >  $ cat small_numbers.txt
> > 9971681
> > 9686036
> > 2592322
> > 4518219
> > 1467363
> >
> > To send my job to the cluster I use:
> > hadoop jar
> > /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \
> > -D "mapred.reduce.tasks=2" \
> > -D "stream.num.map.output.key.fields=1" \
> > -D mapred.text.key.comparator.options=-k1,1n \
> > -input /input \
> > -output /output \
> > -mapper sort_mapper.rb \
> > -file `pwd`/scripts_sort/sort_mapper.rb \
> > -reducer sort_reducer.rb \
> > -file `pwd`/scripts_sort/sort_reducer.rb
> >
> > The mapper code basically writes key, value = input_line, input_line.
> > The reducer just prints the keys from the standard input.
> > Incase you care:
> >  $ cat scripts_sort/sort_*
> > #!/usr/bin/ruby
> >
> > STDIN.each_line {|l| puts "#{l.chomp}\t#{l.chomp}"}
> > ---------------------------------------------------------------------
> > #!/usr/bin/ruby
> >
> > STDIN.each_line { |line| puts line.split[0] }
> > I run the job and it completes without problems, the output looks like:
> > drio@milhouse:~/tmp $ cat output/part-00001
> > 1380664
> > 1467363
> > 32485
> > 3857847
> > 422538
> > 4354952
> > 4518219
> > 5719091
> > 7838358
> > 9686036
> > drio@milhouse:~/tmp $ cat output/part-00000
> > 1453024
> > 2592322
> > 3875994
> > 4689583
> > 5340522
> > 607354
> > 6447778
> > 6535495
> > 8647464
> > 9971681
> > These are my questions:
> > 1. It seems the sorting (per reducer) is working but I don't know why,
for
> > example,
> > 607354 is not the first number in the output.
> >
> > 2. How can I tell hadoop to send data to the reduces in such a way that
> > inputReduce1keys <
> > inputReduce2keys < ..... < inputReduceNkeys. In that way I would ensure
the
> > data
> > is fully sorted once the job is done.
> > I've tried also using the identity classes for the mapper and reducer
but
> > the job dies generating
> > exceptions about the input format.
> > Can anyone show me or point me to some code showing how to properly
perform
> > sorting.
> > Thanks in advance,
> > -drd
> >
> >



--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message