hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kelly Burkhart <kelly.burkh...@gmail.com>
Subject Re: Map reduce streaming unable to partition
Date Thu, 10 Feb 2011 20:48:06 GMT
OK, I think I sumbled upon the correct incantation:

time hadoop jar
/opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \
  -D map.output.key.field.separator=: \
  -D mapred.text.key.partitioner.options=-k1,1 \
  -D mapred.reduce.tasks=16 \
  -input /tmp/krb/part \
  -output /tmp/krb/mp \
  -mapper /bin/cat \
  -reducer /bin/cat \
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

This will partition and sort the files as I expect, leaving me with 16
output files, 14 of which are empty and 2 non-empty.  If I increase
the number of partitions in the data so they exceed the number of
reduce tasks, multiple partitions will be written to some or all of
the output files.  I believe I can deal with that now that I
understand it, but it would be nice if the number of output files was
equal to the number of partitions in the data.

-K

On Thu, Feb 10, 2011 at 11:45 AM, Kelly Burkhart
<kelly.burkhart@gmail.com> wrote:
> Hi,
>
> I'm trying to get partitioning working from a streaming map/reduce
> job.  I'm using hadoop r0.20.2.
>
> Consider the following files, both in the same hdfs directory:
>
> f1:
> 01:01:01<TAB>a,a,a,a,a,1
> 01:01:02<TAB>a,a,a,a,a,2
> 01:02:01<TAB>a,a,a,a,a,3
> 01:02:02<TAB>a,a,a,a,a,4
> 02:01:01<TAB>a,a,a,a,a,5
> 02:01:02<TAB>a,a,a,a,a,6
> 02:02:01<TAB>a,a,a,a,a,7
> 02:02:02<TAB>a,a,a,a,a,8
>
> f2:
> 01:01:01<TAB>b,b,b,b,b,1
> 01:01:02<TAB>b,b,b,b,b,2
> 01:02:01<TAB>b,b,b,b,b,3
> 01:02:02<TAB>b,b,b,b,b,4
> 02:01:01<TAB>b,b,b,b,b,5
> 02:01:02<TAB>b,b,b,b,b,6
> 02:02:01<TAB>b,b,b,b,b,7
> 02:02:02<TAB>b,b,b,b,b,8
>
> I execute the following command:
>
> hadoop jar /opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \
>  -D stream.map.output.field.separator=: \
>  -D stream.num.map.output.key.fields=3 \
>  -D map.output.key.field.separator=: \
>  -D mapred.text.key.partitioner.options=-k1,1 \
>  -input /tmp/krb/part \
>  -output /tmp/krb/mp \
>  -mapper /bin/cat \
>  -reducer /bin/cat \
>  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
>
> (actually I've executed about a zillion permutations of various -D arguments...)
>
> I end up with a single file sorted by the entire key, exactly what I
> expect if no partitioning at all is going on.  What I'm hoping to end
> up with is two output files, each file has the first component of the
> key in common:
>
> 01:01:01<TAB>a,a,a,a,a,1
> 01:01:01<TAB>b,b,b,b,b,1
> 01:01:02<TAB>a,a,a,a,a,2
> 01:01:02<TAB>b,b,b,b,b,2
> 01:02:01<TAB>a,a,a,a,a,3
> 01:02:01<TAB>b,b,b,b,b,3
> 01:02:02<TAB>a,a,a,a,a,4
> 01:02:02<TAB>b,b,b,b,b,4
>
> Can anyone suggest a command that may partition files as I describe?
>
> Also, it seems that the API has changed considerably from my version
> 0.20.x to the latest version r0.21.  Is 0.20 expected to work?  Or are
> there some fatal issues that forced major work resulting in release
> 0.21.
>
> Thanks,
>
> -Kelly
>

Mime
View raw message