Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 36F5117EDA for ; Wed, 21 Jan 2015 06:09:42 +0000 (UTC) Received: (qmail 98948 invoked by uid 500); 21 Jan 2015 06:09:36 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 98844 invoked by uid 500); 21 Jan 2015 06:09:36 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 98826 invoked by uid 99); 21 Jan 2015 06:09:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jan 2015 06:09:35 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of unmeshabiju@gmail.com designates 209.85.213.179 as permitted sender) Received: from [209.85.213.179] (HELO mail-ig0-f179.google.com) (209.85.213.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jan 2015 06:09:31 +0000 Received: by mail-ig0-f179.google.com with SMTP id l13so20344263iga.0 for ; Tue, 20 Jan 2015 22:07:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=2M8mIoqJYPAqiLXdmvS/4feYpkvERNv2uRswtzJXwhE=; b=twhJAU02O66EO5kALUDw1Oru+V3KZiePdCcn+naqM+guBKazeMo0fNisJoSnSWiQ9B 7CdUWV/n1chDW9e2vLDDt5OKwhIzC7ztdSyQpBnQzrw6qx5Om4vFf9scVI5Gk4SLQztV ZcKCI5MmOWoq3PBiRvycPqj3J7xgisfKlFVDoOWwZALkJ1FcWFwA7ADtJJWhDD7eNZ0G UhjxDfob7Qqa1PC8gaarUSRN9hsKUUySZg7Smv+400NbDg/gdVTBXpN3AJexiiBaDFGw opkBgd+eRDmDqV1neuPq+MfTxYv1M7CuouX2nmTme5yQ56a1Ze/97LNJK0lSp4uiPQG3 Z4Qw== X-Received: by 10.42.12.20 with SMTP id w20mr6799943icw.10.1421820460882; Tue, 20 Jan 2015 22:07:40 -0800 (PST) MIME-Version: 1.0 Received: by 10.36.8.8 with HTTP; Tue, 20 Jan 2015 22:07:00 -0800 (PST) In-Reply-To: References: From: unmesha sreeveni Date: Wed, 21 Jan 2015 11:37:00 +0530 Message-ID: Subject: Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce To: User Hadoop Content-Type: multipart/alternative; boundary=20cf301d3dfe815fa3050d235fb3 X-Virus-Checked: Checked by ClamAV on apache.org --20cf301d3dfe815fa3050d235fb3 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I have 4 nodes and the replication factor is set to 3 On Wed, Jan 21, 2015 at 11:15 AM, Drake=EB=AF=BC=EC=98=81=EA=B7=BC wrote: > Yes, almost same. I assume the most time spending part was copying model > data from datanode which has model data to actual process node(tasktracke= r > or nodemanager). > > How about the model data's replication factor? How many nodes do you have= ? > If you have 4 or more nodes, you can increase replication with following > command. I suggest the number equal to your datanodes, but first you shou= ld > confirm the enough space in HDFS. > > > - hdfs dfs -setrep -w 6 /user/model/data > > > > > Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D > > On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni > wrote: > >> Yes I tried the same Drake. >> >> I dont know if I understood your answer. >> >> Instead of loading them into setup() through cache I read them directly >> from HDFS in map section. and for each incoming record .I found the >> distance between all the records in HDFS. >> ie if R ans S are my dataset, R is the model data stored in HDFs >> and when S taken for processing >> S1-R(finding distance with whole R set) >> S2-R >> >> But it is taking a long time as it needs to compute the distance. >> >> On Wed, Jan 21, 2015 at 10:31 AM, Drake=EB=AF=BC=EC=98=81=EA=B7=BC wrote: >> >>> In my suggestion, map or reduce tasks do not use distributed cache. The= y >>> use file directly from HDFS with short circuit local read. Like a share= d >>> storage method, but almost every node has the data with high-replicatio= n >>> factor. >>> >>> Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D >>> >>> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni >> > wrote: >>> >>>> But stil if the model is very large enough, how can we load them inti >>>> Distributed cache or some thing like that. >>>> Here is one source : >>>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf >>>> But it is confusing me >>>> >>>> On Wed, Jan 21, 2015 at 7:30 AM, Drake=EB=AF=BC=EC=98=81=EA=B7=BC wrote: >>>> >>>>> Hi, >>>>> >>>>> How about this ? The large model data stay in HDFS but with many >>>>> replications and MapReduce program read the model from HDFS. In theor= y, the >>>>> replication factor of model data equals with number of data nodes and= with >>>>> the Short Circuit Local Reads function of HDFS datanode, the map or r= educe >>>>> tasks read the model data in their own disks. >>>>> >>>>> In this way, maybe use too many usage of HDFS, but the annoying >>>>> partition problem will be gone. >>>>> >>>>> Thanks >>>>> >>>>> Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D >>>>> >>>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni < >>>>> unmeshabiju@gmail.com> wrote: >>>>> >>>>>> Is there any way.. >>>>>> Waiting for a reply.I have posted the question every where..but none >>>>>> is responding back. >>>>>> I feel like this is the right place to ask doubts. As some of u may >>>>>> came across the same issue and get stuck. >>>>>> >>>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni < >>>>>> unmeshabiju@gmail.com> wrote: >>>>>> >>>>>>> Yes, One of my friend is implemeting the same. I know global sharin= g >>>>>>> of Data is not possible across Hadoop MapReduce. But I need to chec= k if >>>>>>> that can be done somehow in hadoop Mapreduce also. Because I found = some >>>>>>> papers in KNN hadoop also. >>>>>>> And I trying to compare the performance too. >>>>>>> >>>>>>> Hope some pointers can help me. >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning >>>>>> > wrote: >>>>>>> >>>>>>>> >>>>>>>> have you considered implementing using something like spark? That >>>>>>>> could be much easier than raw map-reduce >>>>>>>> >>>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni < >>>>>>>> unmeshabiju@gmail.com> wrote: >>>>>>>> >>>>>>>>> In KNN like algorithm we need to load model Data into cache for >>>>>>>>> predicting the records. >>>>>>>>> >>>>>>>>> Here is the example for KNN. >>>>>>>>> >>>>>>>>> >>>>>>>>> [image: Inline image 1] >>>>>>>>> >>>>>>>>> So if the model will be a large file say1 or 2 GB we will be able >>>>>>>>> to load them into Distributed cache. >>>>>>>>> >>>>>>>>> The one way is to split/partition the model Result into some file= s >>>>>>>>> and perform the distance calculation for all records in that file= and then >>>>>>>>> find the min ditance and max occurance of classlabel and predict = the >>>>>>>>> outcome. >>>>>>>>> >>>>>>>>> How can we parttion the file and perform the operation on these >>>>>>>>> partition ? >>>>>>>>> >>>>>>>>> ie 1 record parttition1,partition2,.... >>>>>>>>> 2nd record parttition1,partition2,... >>>>>>>>> >>>>>>>>> This is what came to my thought. >>>>>>>>> >>>>>>>>> Is there any further way. >>>>>>>>> >>>>>>>>> Any pointers would help me. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Thanks & Regards * >>>>>>>>> >>>>>>>>> >>>>>>>>> *Unmesha Sreeveni U.B* >>>>>>>>> *Hadoop, Bigdata Developer* >>>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>>>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Thanks & Regards * >>>>>>> >>>>>>> >>>>>>> *Unmesha Sreeveni U.B* >>>>>>> *Hadoop, Bigdata Developer* >>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Thanks & Regards * >>>>>> >>>>>> >>>>>> *Unmesha Sreeveni U.B* >>>>>> *Hadoop, Bigdata Developer* >>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> *Thanks & Regards * >>>> >>>> >>>> *Unmesha Sreeveni U.B* >>>> *Hadoop, Bigdata Developer* >>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>> http://www.unmeshasreeveni.blogspot.in/ >>>> >>>> >>>> >>> >> >> >> -- >> *Thanks & Regards * >> >> >> *Unmesha Sreeveni U.B* >> *Hadoop, Bigdata Developer* >> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >> http://www.unmeshasreeveni.blogspot.in/ >> >> >> > --=20 *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ --20cf301d3dfe815fa3050d235fb3 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I have 4 nodes and the replication factor is set to 3

On Wed, Jan 21,= 2015 at 11:15 AM, Drake=EB=AF=BC=EC=98=81=EA=B7=BC <<= a href=3D"mailto:drake.min@nexr.com" target=3D"_blank">drake.min@nexr.com> wrote:
Yes, almost same. I assume the most time spending part was copying model = data from datanode which has model data to actual process node(tasktracker = or nodemanager).=C2=A0

How about the model data's r= eplication factor? How many nodes do you have? If you have 4 or more nodes,= you can increase replication with following command. I suggest the number = equal to your datanodes, but first you should confirm the enough space in H= DFS.

  • hdfs dfs -setrep -w 6 /user/model/data



Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D=

On Wed, Jan 21, 2015 at 2:12 PM, unmesha sre= eveni <unmeshabiju@gmail.com> wrote:
Yes I tried the same Drake.

I dont know if = I understood your answer.

=C2=A0Instead of loading them into setup() t= hrough cache I read them directly from HDFS in map section. and for each in= coming record .I found the distance between all the records in HDFS.
<= div class=3D"gmail_default" style=3D"font-family:verdana,sans-serif">ie if = R ans S are my dataset, R is the model data stored in HDFs
and when S take= n for processing
S1-R(finding distance with whole R set)
S2-R

But it i= s taking a long time as it needs to compute the distance.
<= div>

On Wed, Jan 2= 1, 2015 at 10:31 AM, Drake=EB=AF=BC=EC=98=81=EA=B7=BC <= ;drake.min@nexr.com= > wrote:
I= n my suggestion, map or reduce tasks do not use distributed cache. They use= file directly from HDFS with short circuit local read. Like a shared stora= ge method, but almost every node has the data with high-replication factor.=

Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D

On Wed, Jan 21, 2015 at 1:49 PM, unmesha sre= eveni <unmeshabiju@gmail.com> wrote:
But stil if the model is very large enough, how = can we load them inti Distributed cache or some thing like that.

On Wed, Jan 21, 2015 at 7:30 AM, Drake= =EB=AF=BC=EC=98=81=EA=B7=BC <drake.min@nexr.com> wrote:
=
Hi,=C2=A0

How about this ? The large model data stay in HDFS but with many replicat= ions and MapReduce program read the model from HDFS. In theory, the replica= tion factor of model data equals with number of data nodes and with the=C2= =A0Short Circuit Local Reads function of HDFS datanode, the map or reduce t= asks read the model data in their own disks.=C2=A0

In this way, maybe use too many usage of HDFS, but the annoying partition = problem will be gone.

Thanks

Drake =EB=AF= =BC=EC=98=81=EA=B7=BC Ph.D

On Thu, Jan 15, 2015 at 6:05 PM, unmesha sre= eveni <unmeshabiju@gmail.com> wrote:
Is there any way..
Waiting for a reply.I have po= sted the question every where..but none is responding back.
I feel like thi= s is the right place to ask doubts. As some of u may came across the same i= ssue and get stuck.

On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni = <unmeshabiju@gmail.com> wrote:
Yes, One of my friend is implemeting the same. I know = global sharing of Data is not possible across Hadoop MapReduce. But I need = to check if that can be done somehow in hadoop Mapreduce also. Because I fo= und some papers in KNN hadoop also.
And I trying to compare the performance= too.

Hope some pointers can help me.


On Thu, Jan 15, 2015 at 12:= 17 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

have you considered implementing us= ing something like spark?=C2=A0 That could be much easier than raw map-redu= ce

On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <unmeshabiju= @gmail.com> wrote:

In KNN lik= e algorithm we need to load model Data into cache for predicting the record= s.

Here is the example for KNN.


3D"Inline

So if the model will be a large file say= 1 or 2 GB we will be able to load them into Distributed cache.

The one wa= y is to split/partition the model Result into some files and perform the di= stance calculation for all records in that file and then find the min ditan= ce and max occurance of classlabel and predict the outcome.

How can we par= ttion the file and perform the operation on these partition ?

ie  1 record <Distance&=
gt; parttition1,partition2,....
     2nd record <Distance> parttition1,partition2,...

This is what came to my thought.

Is there any further way.

<= p style=3D"margin:0px 0px 1em;padding:0px;border:0px;font-size:13.636363029= 47998px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Ari= al,'Liberation Sans','DejaVu Sans',sans-serif;line-height:1= 7.804800033569336px;background-image:initial;background-repeat:initial">Any= pointers would help me.

=
--
Thanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=





--
=
T= hanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=




--
Thanks & = Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=





--
Thanks & = Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=





--
Thanks & = Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=





--
Thanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=

--20cf301d3dfe815fa3050d235fb3--