hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subir S <subir.sasiku...@gmail.com>
Subject Re: hadoop streaming : need help in using custom key value separator
Date Tue, 28 Feb 2012 09:06:40 GMT
http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs

Read this link, your options are wrong below.



On Tue, Feb 28, 2012 at 1:13 PM, Austin Chungath <austincv@gmail.com> wrote:

> When I am using more than one reducer in hadoop streaming where I am using
> my custom separater rather than the tab, it looks like the hadoop shuffling
> process is not happening as it should.
>
> This is the reducer output when I am using '\t' to separate my key value
> pair that is output from the mapper.
>
> *output from reducer 1:*
> 10321,22
> 23644,37
> 41231,42
> 23448,20
> 12325,39
> 71234,20
> *output from reducer 2:*
> 24123,43
> 33213,46
> 11321,29
> 21232,32
>
> the above output is as expected the first column is the key and the second
> value is the count. There are 10 unique keys and 6 of them are in output of
> the first reducer and the remaining 4 int the second reducer output.
>
> But now when I use a custom separater for my key value pair output from my
> mapper. Here I am using '*' as the separator
> -D stream.mapred.output.field.separator=*
> -D mapred.reduce.tasks=2
>
> *output from reducer 1:*
> 10321,5
> 21232,19
> 24123,16
> 33213,28
> 23644,21
> 41231,12
> 23448,18
> 11321,29
> 12325,24
> 71234,9
> * *
> *output from reducer 2:*
> 10321,17
> 21232,13
> 33213,18
> 23644,16
> 41231,30
> 23448,2
> 24123,27
> 12325,15
> 71234,11
>
> Now both the reducers are getting all the keys and part of the values go to
> reducer 1 and part of the reducer go to reducer 2.
> Why is it behaving like this when I am using a custom separator, shouldn't
> each reducer get a unique key after the shuffling?
> I am using Hadoop 0.20.205.0 and below is the command that I am using to
> run hadoop streaming. Is there some more options that I should specify for
> hadoop streaming to work properly if I am using a custom separator?
>
> hadoop jar
> $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar
> -D stream.mapred.output.field.separator=*
> -D mapred.reduce.tasks=2
> -mapper ./map.py
> -reducer ./reducer.py
> -file ./map.py
> -file ./reducer.py
> -input /user/inputdata
> -output /user/outputdata
> -verbose
>
>
> Any help is much appreciated,
> Thanks,
> Austin
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message