hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piyush Kansal <piyush.kan...@gmail.com>
Subject Re: Query regarding Hadoop Partitioning
Date Mon, 20 Feb 2012 08:27:20 GMT
Thanks for the immediate reply Harsh. I will try using it.

By the way, cant we achieve the same goal with Hadoop Streaming (using
Python)?

On Mon, Feb 20, 2012 at 2:59 AM, Harsh J <harsh@cloudera.com> wrote:

> Piyush,
>
> Yes. Currently the partitioned data is always sorted by (and then
> grouped by) keys before the reduce() calls begin.
>
> On Mon, Feb 20, 2012 at 12:51 PM, Piyush Kansal <piyush.kansal@gmail.com>
> wrote:
> > Thanks Harsh.
> >
> > But will it also sort the data as Partitioner does.
> >
> >
> > On Sun, Feb 19, 2012 at 10:54 PM, Harsh J <harsh@cloudera.com> wrote:
> >>
> >> Hi,
> >>
> >> You would find it easier to use the Java API's MultipleOutputs (and/or
> >> MultipleOutputFormat, which directly works on a configured key field),
> >> to write each key-partition out in its own file.
> >>
> >> On Mon, Feb 20, 2012 at 7:38 AM, Piyush Kansal <piyush.kansal@gmail.com
> >
> >> wrote:
> >> > Hi Friends,
> >> >
> >> > I have to sort huge amount of data in minimum possible time probably
> >> > using
> >> > partitioning. The key is composed of 3 fields(partition, text and
> >> > number).
> >> > This is how partition is defined:
> >> >
> >> > Partition "1" for range 1-10
> >> > Partition "2" for range 11-20
> >> > Partition "3" for range 21-30
> >> >
> >> > I/P file format: partition[tab]text[tab]range-start[tab]range-end
> >> >
> >> > [cloudera@localhost kMer2]$ cat input1
> >> >
> >> > 1 chr1 1 10
> >> > 1 chr1 2 8
> >> > 2 chr1 11 18
> >> >
> >> > [cloudera@localhost kMer2]$ cat input2
> >> >
> >> > 1 chr1 3 7
> >> > 2 chr1 12 19
> >> >
> >> > [cloudera@localhost kMer2]$ cat input3
> >> >
> >> > 3 chr1 22 30
> >> >
> >> > [cloudera@localhost kMer2]$ cat input4
> >> >
> >> > 3 chr1 22 30
> >> > 1 chr1 9 10
> >> > 2 chr1 15 16
> >> >
> >> > Then I ran following command:
> >> >
> >> > hadoop jar
> >> > /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
> >> > -D stream.map.output.field.separator='\t' \
> >> > -D stream.num.map.output.key.fields=3 \
> >> > -D map.output.key.field.separator='\t' \
> >> > -D mapred.text.key.partitioner.options=-k1 \
> >> > -D mapred.reduce.tasks=3 \
> >> > -input /usr/pkansal/kMer2/ip \
> >> > -output /usr/pkansal/kMer2/op \
> >> > -mapper /home/cloudera/kMer2/kMer2Map.py \
> >> > -file /home/cloudera/kMer2/kMer2Map.py \
> >> > -reducer /home/cloudera/kMer2/kMer2Red.py \
> >> > -file /home/cloudera/kMer2/kMer2Red.py
> >> >
> >> > Both mapper and reducer scripts just contain one line of code:
> >> >
> >> > for line in sys.stdin:
> >> >     line = line.strip()
> >> >     print "%s" % (line)
> >> >
> >> > Following is the o/p:
> >> >
> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat
> >> > /usr/pkansal/kMer2/op/part-00000
> >> >
> >> > 2 chr1 12 19
> >> > 2 chr1 15 16
> >> > 3 chr1 22 30
> >> > 3 chr1 22 30
> >> >
> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat
> >> > /usr/pkansal/kMer2/op/part-00001
> >> >
> >> > 1 chr1 2 8
> >> > 1 chr1 3 7
> >> > 1 chr1 9 10
> >> > 2 chr1 11 18
> >> >
> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat
> >> > /usr/pkansal/kMer2/op/part-00002
> >> >
> >> > 1 chr1 1 10
> >> > 3 chr1 22 29
> >> >
> >> > This is not the o/p which I expected. I expected all records with:
> >> >
> >> > partition 1 in one single file eg part-m-00000
> >> > partition 2 in one single file eg part-m-00001
> >> > partition 3 in one single file eg part-m-00002
> >> >
> >> > Can you please suggest if I am doing it in a right way?
> >> >
> >> > --
> >> > Regards,
> >> > Piyush Kansal
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> Customer Ops. Engineer
> >> Cloudera | http://tiny.cloudera.com/about
> >
> >
> >
> >
> > --
> > Regards,
> > Piyush Kansal
> >
>
>
>
> --
> Harsh J
> Customer Ops. Engineer
> Cloudera | http://tiny.cloudera.com/about
>



-- 
Regards,
Piyush Kansal

Mime
View raw message