hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piyush Kansal <piyush.kan...@gmail.com>
Subject Re: Query regarding Hadoop Partitioning
Date Mon, 20 Feb 2012 06:56:21 GMT
Thanks Utkarsh.

But I cant find such function in Hadoop. Moreover, is there any reason why
default partitioning wont work? I mean if it does not work, then why its
even there. May be I am missing something?

On Sun, Feb 19, 2012 at 10:40 PM, Utkarsh Gupta
<Utkarsh_Gupta@infosys.com>wrote:

> Hi Piyush,****
>
> ** **
>
> I think you need to override the inbuilt partitioning function.****
>
> You can use function like (first field of key)%3****
>
> This will send all the keys with same first field to a separate reduce
> process****
>
> Please correct me if I am wrong.****
>
> Thanks ****
>
> Utkarsh****
>
> *From:* Piyush Kansal [mailto:piyush.kansal@gmail.com]
> *Sent:* Monday, February 20, 2012 7:39 AM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Query regarding Hadoop Partitioning****
>
> ** **
>
> Hi Friends,****
>
> I have to sort huge amount of data in minimum possible time probably using
> partitioning. The key is composed of 3 fields(partition, text and number).
> This is how partition is defined:****
>
>    - Partition "1" for range 1-10****
>    - Partition "2" for range 11-20****
>    - Partition "3" for range 21-30****
>
> *I/P file format*: partition[tab]text[tab]range-start[tab]range-end****
>
> [cloudera@localhost kMer2]$ cat input1****
>
>    - 1 chr1 1 10****
>    - 1 chr1 2 8****
>    - 2 chr1 11 18****
>
> [cloudera@localhost kMer2]$ cat input2****
>
>    - 1 chr1 3 7****
>    - 2 chr1 12 19****
>
> [cloudera@localhost kMer2]$ cat input3****
>
>    - 3 chr1 22 30****
>
> [cloudera@localhost kMer2]$ cat input4****
>
>    - 3 chr1 22 30****
>    - 1 chr1 9 10****
>    - 2 chr1 15 16****
>
> Then I ran following command:****
>
> hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \****
>
> -D stream.map.output.field.separator='\t' \****
>
> -D stream.num.map.output.key.fields=3 \****
>
> -D map.output.key.field.separator='\t' \****
>
> -D mapred.text.key.partitioner.options=-k1 \****
>
> -D mapred.reduce.tasks=3 \****
>
> -input /usr/pkansal/kMer2/ip \****
>
> -output /usr/pkansal/kMer2/op \****
>
> -mapper /home/cloudera/kMer2/kMer2Map.py \****
>
> -file /home/cloudera/kMer2/kMer2Map.py \****
>
> -reducer /home/cloudera/kMer2/kMer2Red.py \****
>
> -file /home/cloudera/kMer2/kMer2Red.py****
>
> Both mapper and reducer scripts just contain one line of code:****
>
> for line in sys.stdin:****
>
>     line = line.strip()****
>
>     print "%s" % (line)****
>
> Following is the o/p:****
>
> [cloudera@localhost kMer2]$ hadoop dfs -cat
> /usr/pkansal/kMer2/op/part-00000****
>
>    - 2 chr1 12 19****
>    - 2 chr1 15 16****
>    - 3 chr1 22 30****
>    - 3 chr1 22 30****
>
> [cloudera@localhost kMer2]$ hadoop dfs -cat
> /usr/pkansal/kMer2/op/part-00001****
>
>    - 1 chr1 2 8****
>    - 1 chr1 3 7****
>    - 1 chr1 9 10****
>    - 2 chr1 11 18****
>
> [cloudera@localhost kMer2]$ hadoop dfs -cat
> /usr/pkansal/kMer2/op/part-00002****
>
>    - 1 chr1 1 10****
>    - 3 chr1 22 29****
>
> This is not the o/p which I expected. I expected all records with:****
>
>    - partition 1 in one single file eg part-m-00000****
>    - partition 2 in one single file eg part-m-00001****
>    - partition 3 in one single file eg part-m-00002****
>
> Can you please suggest if I am doing it in a right way?****
>
> --
> Regards,
> Piyush Kansal****
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are not
> to copy, disclose, or distribute this e-mail or its contents to any other person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
> every reasonable precaution to minimize this risk, but is not liable for any damage
> you may sustain as a result of any virus in this e-mail. You should carry out your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>
>


-- 
Regards,
Piyush Kansal

Mime
View raw message