Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D2A3517DCA for ; Wed, 21 Jan 2015 05:13:34 +0000 (UTC) Received: (qmail 7758 invoked by uid 500); 21 Jan 2015 05:13:29 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 7653 invoked by uid 500); 21 Jan 2015 05:13:29 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 7643 invoked by uid 99); 21 Jan 2015 05:13:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jan 2015 05:13:29 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of unmeshabiju@gmail.com designates 209.85.213.179 as permitted sender) Received: from [209.85.213.179] (HELO mail-ig0-f179.google.com) (209.85.213.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jan 2015 05:13:25 +0000 Received: by mail-ig0-f179.google.com with SMTP id l13so20211857iga.0 for ; Tue, 20 Jan 2015 21:13:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=lYD1M/AfJ/1Um7gsRnLJMnpTjhOM5aXg/4aBghmknxE=; b=GTA9agJ4B+UwGZNRi9SWpUTL46pnfZoEaa29HYnaSQ+j7fQ3BR9GgYk/cF1fUI9wMX B39A+3TQuWlqDvGbhu8Pwos7DKvoZ27T9gNPpvJj+cXKVcZfTHcUV5WWY3tluaRSIfDB WF1Nd250Ag3rpgY+p8OogyDgabBf49Np3xcZ+nvbtiuKVV1Ng10zCRcbMSqo6iiIRTVc orgBiJs1ZruacfZJUzfoCTZnsZGYtDsRoEGUi+YEQ8Cij1NLI9LAUy9igyJqgP621JC4 TGQ7K8DAGPXR+9J0D3QPIKX2kPLD+pPKpeGPctrAYCwsc1YZNBXOjaXkxPBYxSms1isX Ot/g== X-Received: by 10.50.124.133 with SMTP id mi5mr32392467igb.13.1421817184455; Tue, 20 Jan 2015 21:13:04 -0800 (PST) MIME-Version: 1.0 Received: by 10.36.8.8 with HTTP; Tue, 20 Jan 2015 21:12:24 -0800 (PST) In-Reply-To: References: From: unmesha sreeveni Date: Wed, 21 Jan 2015 10:42:24 +0530 Message-ID: Subject: Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce To: User Hadoop Content-Type: multipart/alternative; boundary=001a1135f578371122050d229ce1 X-Virus-Checked: Checked by ClamAV on apache.org --001a1135f578371122050d229ce1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Yes I tried the same Drake. I dont know if I understood your answer. Instead of loading them into setup() through cache I read them directly from HDFS in map section. and for each incoming record .I found the distance between all the records in HDFS. ie if R ans S are my dataset, R is the model data stored in HDFs and when S taken for processing S1-R(finding distance with whole R set) S2-R But it is taking a long time as it needs to compute the distance. On Wed, Jan 21, 2015 at 10:31 AM, Drake=EB=AF=BC=EC=98=81=EA=B7=BC wrote: > In my suggestion, map or reduce tasks do not use distributed cache. They > use file directly from HDFS with short circuit local read. Like a shared > storage method, but almost every node has the data with high-replication > factor. > > Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D > > On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni > wrote: > >> But stil if the model is very large enough, how can we load them inti >> Distributed cache or some thing like that. >> Here is one source : >> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf >> But it is confusing me >> >> On Wed, Jan 21, 2015 at 7:30 AM, Drake=EB=AF=BC=EC=98=81=EA=B7=BC wrote: >> >>> Hi, >>> >>> How about this ? The large model data stay in HDFS but with many >>> replications and MapReduce program read the model from HDFS. In theory,= the >>> replication factor of model data equals with number of data nodes and w= ith >>> the Short Circuit Local Reads function of HDFS datanode, the map or red= uce >>> tasks read the model data in their own disks. >>> >>> In this way, maybe use too many usage of HDFS, but the annoying >>> partition problem will be gone. >>> >>> Thanks >>> >>> Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D >>> >>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni >> > wrote: >>> >>>> Is there any way.. >>>> Waiting for a reply.I have posted the question every where..but none i= s >>>> responding back. >>>> I feel like this is the right place to ask doubts. As some of u may >>>> came across the same issue and get stuck. >>>> >>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni < >>>> unmeshabiju@gmail.com> wrote: >>>> >>>>> Yes, One of my friend is implemeting the same. I know global sharing >>>>> of Data is not possible across Hadoop MapReduce. But I need to check = if >>>>> that can be done somehow in hadoop Mapreduce also. Because I found so= me >>>>> papers in KNN hadoop also. >>>>> And I trying to compare the performance too. >>>>> >>>>> Hope some pointers can help me. >>>>> >>>>> >>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning >>>>> wrote: >>>>> >>>>>> >>>>>> have you considered implementing using something like spark? That >>>>>> could be much easier than raw map-reduce >>>>>> >>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni < >>>>>> unmeshabiju@gmail.com> wrote: >>>>>> >>>>>>> In KNN like algorithm we need to load model Data into cache for >>>>>>> predicting the records. >>>>>>> >>>>>>> Here is the example for KNN. >>>>>>> >>>>>>> >>>>>>> [image: Inline image 1] >>>>>>> >>>>>>> So if the model will be a large file say1 or 2 GB we will be able t= o >>>>>>> load them into Distributed cache. >>>>>>> >>>>>>> The one way is to split/partition the model Result into some files >>>>>>> and perform the distance calculation for all records in that file a= nd then >>>>>>> find the min ditance and max occurance of classlabel and predict th= e >>>>>>> outcome. >>>>>>> >>>>>>> How can we parttion the file and perform the operation on these >>>>>>> partition ? >>>>>>> >>>>>>> ie 1 record parttition1,partition2,.... >>>>>>> 2nd record parttition1,partition2,... >>>>>>> >>>>>>> This is what came to my thought. >>>>>>> >>>>>>> Is there any further way. >>>>>>> >>>>>>> Any pointers would help me. >>>>>>> >>>>>>> -- >>>>>>> *Thanks & Regards * >>>>>>> >>>>>>> >>>>>>> *Unmesha Sreeveni U.B* >>>>>>> *Hadoop, Bigdata Developer* >>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Thanks & Regards * >>>>> >>>>> >>>>> *Unmesha Sreeveni U.B* >>>>> *Hadoop, Bigdata Developer* >>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Thanks & Regards * >>>> >>>> >>>> *Unmesha Sreeveni U.B* >>>> *Hadoop, Bigdata Developer* >>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>> http://www.unmeshasreeveni.blogspot.in/ >>>> >>>> >>>> >>> >> >> >> -- >> *Thanks & Regards * >> >> >> *Unmesha Sreeveni U.B* >> *Hadoop, Bigdata Developer* >> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >> http://www.unmeshasreeveni.blogspot.in/ >> >> >> > --=20 *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ --001a1135f578371122050d229ce1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Yes I tried the same Drake.

I dont know if I understood yo= ur answer.

=C2=A0Instead of loading them into setup() through cache I = read them directly from HDFS in map section. and for each incoming record .= I found the distance between all the records in HDFS.
ie if R ans S are my = dataset, R is the model data stored in HDFs
and when S taken for processing=
S1-R(finding distance with whole R set)
S2-R

But it is taking a long = time as it needs to compute the distance.

On Wed, Jan 21, 2015 at 10:31 AM, Drake= =EB=AF=BC=EC=98=81=EA=B7=BC <drake.min@nexr.com> wrote:
=
In my suggestion, map or re= duce tasks do not use distributed cache. They use file directly from HDFS w= ith short circuit local read. Like a shared storage method, but almost ever= y node has the data with high-replication factor.

Drake =EB=AF=BC=EC=98= =81=EA=B7=BC Ph.D

On Wed, Jan 21, 2015 at 1:49 PM, unmesha sre= eveni <unmeshabiju@gmail.com> wrote:
But stil if the model is very large enough, how = can we load them inti Distributed cache or some thing like that.

On Wed, Jan 21, 2015 at 7:30 AM, Drake= =EB=AF=BC=EC=98=81=EA=B7=BC <drake.min@nexr.com> wrote:
=
Hi,=C2=A0

How about this ? The large model data stay in HDFS but with many replicat= ions and MapReduce program read the model from HDFS. In theory, the replica= tion factor of model data equals with number of data nodes and with the=C2= =A0Short Circuit Local Reads function of HDFS datanode, the map or reduce t= asks read the model data in their own disks.=C2=A0

In this way, maybe use too many usage of HDFS, but the annoying partition = problem will be gone.

Thanks

Drake =EB=AF= =BC=EC=98=81=EA=B7=BC Ph.D

On Thu, Jan 15, 2015 at 6:05 PM, unmesha sre= eveni <unmeshabiju@gmail.com> wrote:
Is there any way..
Waiting for a reply.I have po= sted the question every where..but none is responding back.
I feel like thi= s is the right place to ask doubts. As some of u may came across the same i= ssue and get stuck.

On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni = <unmeshabiju@gmail.com> wrote:
Yes, One of my friend is implemeting the same. I know = global sharing of Data is not possible across Hadoop MapReduce. But I need = to check if that can be done somehow in hadoop Mapreduce also. Because I fo= und some papers in KNN hadoop also.
And I trying to compare the performance= too.

Hope some pointers can help me.


On Thu, Jan 15, 2015 at 12:= 17 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

have you considered implementing us= ing something like spark?=C2=A0 That could be much easier than raw map-redu= ce

On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <unmeshabiju= @gmail.com> wrote:

In KNN lik= e algorithm we need to load model Data into cache for predicting the record= s.

Here is the example for KNN.


3D"Inline

So if the model will be a large file say= 1 or 2 GB we will be able to load them into Distributed cache.

The one wa= y is to split/partition the model Result into some files and perform the di= stance calculation for all records in that file and then find the min ditan= ce and max occurance of classlabel and predict the outcome.

How can we par= ttion the file and perform the operation on these partition ?

ie  1 record <Distance&=
gt; parttition1,partition2,....
     2nd record <Distance> parttition1,partition2,...

This is what came to my thought.

Is there any further way.

<= p style=3D"margin:0px 0px 1em;padding:0px;border:0px;font-size:13.636363029= 47998px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Ari= al,'Liberation Sans','DejaVu Sans',sans-serif;line-height:1= 7.804800033569336px;background-image:initial;background-repeat:initial">Any= pointers would help me.

=
--
Thanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=





--
=
T= hanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=




--
Thanks & = Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=





--
Thanks & = Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=





--
Thanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=

--001a1135f578371122050d229ce1--