Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9DAF017E6E for ; Wed, 21 Jan 2015 05:48:51 +0000 (UTC) Received: (qmail 60239 invoked by uid 500); 21 Jan 2015 05:48:43 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 60106 invoked by uid 500); 21 Jan 2015 05:48:43 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 60096 invoked by uid 99); 21 Jan 2015 05:48:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jan 2015 05:48:43 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [209.85.218.49] (HELO mail-oi0-f49.google.com) (209.85.218.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jan 2015 05:48:17 +0000 Received: by mail-oi0-f49.google.com with SMTP id a3so11682105oib.8 for ; Tue, 20 Jan 2015 21:45:40 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=+8giaTG+TH/NboBDX29xBv976MiHc+ZNgokvp2lV5dA=; b=GmH+t3VcVZauokqsUI0nMZKh+sZvaiV2cmbYd8OtRYA+azj0fLROlNibVsQq0gANaU Pos4Cdl3HplCRo1MWuGea4HWH30e4h891/qD6iT0M+bBfJ3EKyeJyVuB7zSq56eYISDl 9DQexG7jUxQQq2cp4DHBOqcJZ8KjLYXWTCC0W5FjGRtFg1/kQNyb4tuRR92PDI9hwThd 6U/bevCXziV3ELfZyC7vj4FhIK5f+IrXceQJAnoJLjgd4V1AjBBvvLP9xCrLUzpzlOLw S4oAUNJvxzGD1kJCkIcydjaK6Rr9hczYz6kgPJNO/pCSAy+R9XUFV5qkCi2I7i63Ojjh CsXw== X-Gm-Message-State: ALoCoQmqIffDu5Wv6d+59uMz1GPKdtmOZakzBboCWZoz+At1r5+I6PXd27V8mJH9epcxmlooxkUH MIME-Version: 1.0 X-Received: by 10.202.86.18 with SMTP id k18mr3177868oib.42.1421819140561; Tue, 20 Jan 2015 21:45:40 -0800 (PST) Received: by 10.182.22.196 with HTTP; Tue, 20 Jan 2015 21:45:40 -0800 (PST) In-Reply-To: References: Date: Wed, 21 Jan 2015 14:45:40 +0900 Message-ID: Subject: Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce From: =?UTF-8?B?RHJha2Xrr7zsmIHqt7w=?= To: user Content-Type: multipart/alternative; boundary=001a113de510cef28f050d23101a X-Virus-Checked: Checked by ClamAV on apache.org --001a113de510cef28f050d23101a Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Yes, almost same. I assume the most time spending part was copying model data from datanode which has model data to actual process node(tasktracker or nodemanager). How about the model data's replication factor? How many nodes do you have? If you have 4 or more nodes, you can increase replication with following command. I suggest the number equal to your datanodes, but first you should confirm the enough space in HDFS. - hdfs dfs -setrep -w 6 /user/model/data Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni wrote: > Yes I tried the same Drake. > > I dont know if I understood your answer. > > Instead of loading them into setup() through cache I read them directly > from HDFS in map section. and for each incoming record .I found the > distance between all the records in HDFS. > ie if R ans S are my dataset, R is the model data stored in HDFs > and when S taken for processing > S1-R(finding distance with whole R set) > S2-R > > But it is taking a long time as it needs to compute the distance. > > On Wed, Jan 21, 2015 at 10:31 AM, Drake=EB=AF=BC=EC=98=81=EA=B7=BC wrote: > >> In my suggestion, map or reduce tasks do not use distributed cache. They >> use file directly from HDFS with short circuit local read. Like a shared >> storage method, but almost every node has the data with high-replication >> factor. >> >> Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D >> >> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni >> wrote: >> >>> But stil if the model is very large enough, how can we load them inti >>> Distributed cache or some thing like that. >>> Here is one source : >>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf >>> But it is confusing me >>> >>> On Wed, Jan 21, 2015 at 7:30 AM, Drake=EB=AF=BC=EC=98=81=EA=B7=BC wrote: >>> >>>> Hi, >>>> >>>> How about this ? The large model data stay in HDFS but with many >>>> replications and MapReduce program read the model from HDFS. In theory= , the >>>> replication factor of model data equals with number of data nodes and = with >>>> the Short Circuit Local Reads function of HDFS datanode, the map or re= duce >>>> tasks read the model data in their own disks. >>>> >>>> In this way, maybe use too many usage of HDFS, but the annoying >>>> partition problem will be gone. >>>> >>>> Thanks >>>> >>>> Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D >>>> >>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni < >>>> unmeshabiju@gmail.com> wrote: >>>> >>>>> Is there any way.. >>>>> Waiting for a reply.I have posted the question every where..but none >>>>> is responding back. >>>>> I feel like this is the right place to ask doubts. As some of u may >>>>> came across the same issue and get stuck. >>>>> >>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni < >>>>> unmeshabiju@gmail.com> wrote: >>>>> >>>>>> Yes, One of my friend is implemeting the same. I know global sharing >>>>>> of Data is not possible across Hadoop MapReduce. But I need to check= if >>>>>> that can be done somehow in hadoop Mapreduce also. Because I found s= ome >>>>>> papers in KNN hadoop also. >>>>>> And I trying to compare the performance too. >>>>>> >>>>>> Hope some pointers can help me. >>>>>> >>>>>> >>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> have you considered implementing using something like spark? That >>>>>>> could be much easier than raw map-reduce >>>>>>> >>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni < >>>>>>> unmeshabiju@gmail.com> wrote: >>>>>>> >>>>>>>> In KNN like algorithm we need to load model Data into cache for >>>>>>>> predicting the records. >>>>>>>> >>>>>>>> Here is the example for KNN. >>>>>>>> >>>>>>>> >>>>>>>> [image: Inline image 1] >>>>>>>> >>>>>>>> So if the model will be a large file say1 or 2 GB we will be able >>>>>>>> to load them into Distributed cache. >>>>>>>> >>>>>>>> The one way is to split/partition the model Result into some files >>>>>>>> and perform the distance calculation for all records in that file = and then >>>>>>>> find the min ditance and max occurance of classlabel and predict t= he >>>>>>>> outcome. >>>>>>>> >>>>>>>> How can we parttion the file and perform the operation on these >>>>>>>> partition ? >>>>>>>> >>>>>>>> ie 1 record parttition1,partition2,.... >>>>>>>> 2nd record parttition1,partition2,... >>>>>>>> >>>>>>>> This is what came to my thought. >>>>>>>> >>>>>>>> Is there any further way. >>>>>>>> >>>>>>>> Any pointers would help me. >>>>>>>> >>>>>>>> -- >>>>>>>> *Thanks & Regards * >>>>>>>> >>>>>>>> >>>>>>>> *Unmesha Sreeveni U.B* >>>>>>>> *Hadoop, Bigdata Developer* >>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Thanks & Regards * >>>>>> >>>>>> >>>>>> *Unmesha Sreeveni U.B* >>>>>> *Hadoop, Bigdata Developer* >>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Thanks & Regards * >>>>> >>>>> >>>>> *Unmesha Sreeveni U.B* >>>>> *Hadoop, Bigdata Developer* >>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> *Thanks & Regards * >>> >>> >>> *Unmesha Sreeveni U.B* >>> *Hadoop, Bigdata Developer* >>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>> http://www.unmeshasreeveni.blogspot.in/ >>> >>> >>> >> > > > -- > *Thanks & Regards * > > > *Unmesha Sreeveni U.B* > *Hadoop, Bigdata Developer* > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* > http://www.unmeshasreeveni.blogspot.in/ > > > --001a113de510cef28f050d23101a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Yes, almost same. I assume the most time spending par= t was copying model data from datanode which has model data to actual proce= ss node(tasktracker or nodemanager).=C2=A0

How about th= e model data's replication factor? How many nodes do you have? If you h= ave 4 or more nodes, you can increase replication with following command. I= suggest the number equal to your datanodes, but first you should confirm t= he enough space in HDFS.

  • hdfs dfs -setrep -w 6 /user/model/d= ata



Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D

On Wed, Jan 21, 2015 at 2:12 PM, unmesha sre= eveni <unmeshabiju@gmail.com> wrote:
Yes I tried the same Drake.

I dont know if = I understood your answer.

=C2=A0Instead of loading them into setup() t= hrough cache I read them directly from HDFS in map section. and for each in= coming record .I found the distance between all the records in HDFS.
<= div class=3D"gmail_default" style=3D"font-family:verdana,sans-serif">ie if = R ans S are my dataset, R is the model data stored in HDFs
and when S take= n for processing
S1-R(finding distance with whole R set)
S2-R

But it i= s taking a long time as it needs to compute the distance.

On Wed, Jan 21, 2015 at 10:31 AM, Drake=EB=AF=BC=EC=98=81= =EA=B7=BC <drake.min@nexr.com> wrote:
In my suggestion, map or reduce tasks do not= use distributed cache. They use file directly from HDFS with short circuit= local read. Like a shared storage method, but almost every node has the da= ta with high-replication factor.

Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D<= /div>

On Wed, Jan 21, 2015 at 1:49 PM, unmesha sre= eveni <unmeshabiju@gmail.com> wrote:
But stil if the model is very large enough, how = can we load them inti Distributed cache or some thing like that.

On Wed, Jan 21, 2015 at 7:30 AM, Drake= =EB=AF=BC=EC=98=81=EA=B7=BC <drake.min@nexr.com> wrote:
=
Hi,=C2=A0

How about this ? The large model data stay in HDFS but with many replicat= ions and MapReduce program read the model from HDFS. In theory, the replica= tion factor of model data equals with number of data nodes and with the=C2= =A0Short Circuit Local Reads function of HDFS datanode, the map or reduce t= asks read the model data in their own disks.=C2=A0

In this way, maybe use too many usage of HDFS, but the annoying partition = problem will be gone.

Thanks

Drake =EB=AF= =BC=EC=98=81=EA=B7=BC Ph.D

On Thu, Jan 15, 2015 at 6:05 PM, unmesha sre= eveni <unmeshabiju@gmail.com> wrote:
Is there any way..
Waiting for a reply.I have po= sted the question every where..but none is responding back.
I feel like thi= s is the right place to ask doubts. As some of u may came across the same i= ssue and get stuck.

On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni = <unmeshabiju@gmail.com> wrote:
Yes, One of my friend is implemeting the same. I know = global sharing of Data is not possible across Hadoop MapReduce. But I need = to check if that can be done somehow in hadoop Mapreduce also. Because I fo= und some papers in KNN hadoop also.
And I trying to compare the performance= too.

Hope some pointers can help me.


On Thu, Jan 15, 2015 at 12:= 17 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

have you considered implementing us= ing something like spark?=C2=A0 That could be much easier than raw map-redu= ce

On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <unmeshabiju= @gmail.com> wrote:

In KNN lik= e algorithm we need to load model Data into cache for predicting the record= s.

Here is the example for KNN.


3D"Inline

So if the model will be a large file say= 1 or 2 GB we will be able to load them into Distributed cache.

The one wa= y is to split/partition the model Result into some files and perform the di= stance calculation for all records in that file and then find the min ditan= ce and max occurance of classlabel and predict the outcome.

How can we par= ttion the file and perform the operation on these partition ?

ie  1 record <Distance&=
gt; parttition1,partition2,....
     2nd record <Distance> parttition1,partition2,...

This is what came to my thought.

Is there any further way.

<= p style=3D"margin:0px 0px 1em;padding:0px;border:0px;font-size:13.636363029= 47998px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Ari= al,'Liberation Sans','DejaVu Sans',sans-serif;line-height:1= 7.804800033569336px;background-image:initial;background-repeat:initial">Any= pointers would help me.

=
--
Thanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=





--
=
T= hanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=




--
Thanks & = Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=





--
Thanks & = Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=





--
Thanks & = Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=


--001a113de510cef28f050d23101a--