hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piyush Kansal <piyush.kan...@gmail.com>
Subject Re: Query regarding Hadoop Partitioning
Date Fri, 24 Feb 2012 20:48:53 GMT
Hi,

I am right now stuck with an issue while extending the Partitioner class:

*public class MOPartition extends Partitioner<Text, Text>*
        *public MOPartition() {}*

java.lang.RuntimeException: java.lang.NoSuchMethodException:
globalSort$MOPartition.<init>()

I tried defining a empty constructor but still it didnt help. My JRE
version is 1.6.0.26.

Can you please suggest what can be the issue?

On Mon, Feb 20, 2012 at 4:12 AM, Piyush Kansal <piyush.kansal@gmail.com>wrote:

> Thanks Harsh. I will try it and will get back to you.
>
>
> On Mon, Feb 20, 2012 at 3:55 AM, Harsh J <harsh@cloudera.com> wrote:
>
>> I do not think you can do it out of the box with streaming, but
>> last.fm's Dumbo (highly recommended if you use Python M/R) and its
>> add-on Feathers libraries can do it apparently.
>>
>> See Erik Forsberg's detailed answer (second) on
>>
>> http://stackoverflow.com/questions/1626786/generating-separate-output-files-in-hadoop-streaming
>> for more.
>>
>> On Mon, Feb 20, 2012 at 1:57 PM, Piyush Kansal <piyush.kansal@gmail.com>
>> wrote:
>> > Thanks for the immediate reply Harsh. I will try using it.
>> >
>> > By the way, cant we achieve the same goal with Hadoop Streaming (using
>> > Python)?
>> >
>> >
>> > On Mon, Feb 20, 2012 at 2:59 AM, Harsh J <harsh@cloudera.com> wrote:
>> >>
>> >> Piyush,
>> >>
>> >> Yes. Currently the partitioned data is always sorted by (and then
>> >> grouped by) keys before the reduce() calls begin.
>> >>
>> >> On Mon, Feb 20, 2012 at 12:51 PM, Piyush Kansal <
>> piyush.kansal@gmail.com>
>> >> wrote:
>> >> > Thanks Harsh.
>> >> >
>> >> > But will it also sort the data as Partitioner does.
>> >> >
>> >> >
>> >> > On Sun, Feb 19, 2012 at 10:54 PM, Harsh J <harsh@cloudera.com>
>> wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> You would find it easier to use the Java API's MultipleOutputs
>> (and/or
>> >> >> MultipleOutputFormat, which directly works on a configured key
>> field),
>> >> >> to write each key-partition out in its own file.
>> >> >>
>> >> >> On Mon, Feb 20, 2012 at 7:38 AM, Piyush Kansal
>> >> >> <piyush.kansal@gmail.com>
>> >> >> wrote:
>> >> >> > Hi Friends,
>> >> >> >
>> >> >> > I have to sort huge amount of data in minimum possible time
>> probably
>> >> >> > using
>> >> >> > partitioning. The key is composed of 3 fields(partition, text
and
>> >> >> > number).
>> >> >> > This is how partition is defined:
>> >> >> >
>> >> >> > Partition "1" for range 1-10
>> >> >> > Partition "2" for range 11-20
>> >> >> > Partition "3" for range 21-30
>> >> >> >
>> >> >> > I/P file format: partition[tab]text[tab]range-start[tab]range-end
>> >> >> >
>> >> >> > [cloudera@localhost kMer2]$ cat input1
>> >> >> >
>> >> >> > 1 chr1 1 10
>> >> >> > 1 chr1 2 8
>> >> >> > 2 chr1 11 18
>> >> >> >
>> >> >> > [cloudera@localhost kMer2]$ cat input2
>> >> >> >
>> >> >> > 1 chr1 3 7
>> >> >> > 2 chr1 12 19
>> >> >> >
>> >> >> > [cloudera@localhost kMer2]$ cat input3
>> >> >> >
>> >> >> > 3 chr1 22 30
>> >> >> >
>> >> >> > [cloudera@localhost kMer2]$ cat input4
>> >> >> >
>> >> >> > 3 chr1 22 30
>> >> >> > 1 chr1 9 10
>> >> >> > 2 chr1 15 16
>> >> >> >
>> >> >> > Then I ran following command:
>> >> >> >
>> >> >> > hadoop jar
>> >> >> >
>> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar
>> >> >> > \
>> >> >> > -D stream.map.output.field.separator='\t' \
>> >> >> > -D stream.num.map.output.key.fields=3 \
>> >> >> > -D map.output.key.field.separator='\t' \
>> >> >> > -D mapred.text.key.partitioner.options=-k1 \
>> >> >> > -D mapred.reduce.tasks=3 \
>> >> >> > -input /usr/pkansal/kMer2/ip \
>> >> >> > -output /usr/pkansal/kMer2/op \
>> >> >> > -mapper /home/cloudera/kMer2/kMer2Map.py \
>> >> >> > -file /home/cloudera/kMer2/kMer2Map.py \
>> >> >> > -reducer /home/cloudera/kMer2/kMer2Red.py \
>> >> >> > -file /home/cloudera/kMer2/kMer2Red.py
>> >> >> >
>> >> >> > Both mapper and reducer scripts just contain one line of code:
>> >> >> >
>> >> >> > for line in sys.stdin:
>> >> >> >     line = line.strip()
>> >> >> >     print "%s" % (line)
>> >> >> >
>> >> >> > Following is the o/p:
>> >> >> >
>> >> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat
>> >> >> > /usr/pkansal/kMer2/op/part-00000
>> >> >> >
>> >> >> > 2 chr1 12 19
>> >> >> > 2 chr1 15 16
>> >> >> > 3 chr1 22 30
>> >> >> > 3 chr1 22 30
>> >> >> >
>> >> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat
>> >> >> > /usr/pkansal/kMer2/op/part-00001
>> >> >> >
>> >> >> > 1 chr1 2 8
>> >> >> > 1 chr1 3 7
>> >> >> > 1 chr1 9 10
>> >> >> > 2 chr1 11 18
>> >> >> >
>> >> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat
>> >> >> > /usr/pkansal/kMer2/op/part-00002
>> >> >> >
>> >> >> > 1 chr1 1 10
>> >> >> > 3 chr1 22 29
>> >> >> >
>> >> >> > This is not the o/p which I expected. I expected all records
with:
>> >> >> >
>> >> >> > partition 1 in one single file eg part-m-00000
>> >> >> > partition 2 in one single file eg part-m-00001
>> >> >> > partition 3 in one single file eg part-m-00002
>> >> >> >
>> >> >> > Can you please suggest if I am doing it in a right way?
>> >> >> >
>> >> >> > --
>> >> >> > Regards,
>> >> >> > Piyush Kansal
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Harsh J
>> >> >> Customer Ops. Engineer
>> >> >> Cloudera | http://tiny.cloudera.com/about
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Piyush Kansal
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >> Customer Ops. Engineer
>> >> Cloudera | http://tiny.cloudera.com/about
>> >
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Piyush Kansal
>> >
>>
>>
>>
>> --
>> Harsh J
>> Customer Ops. Engineer
>> Cloudera | http://tiny.cloudera.com/about
>>
>
>
>
> --
> Regards,
> Piyush Kansal
>
>


-- 
Regards,
Piyush Kansal

Mime
View raw message