Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7901117962 for ; Wed, 21 Jan 2015 02:03:48 +0000 (UTC) Received: (qmail 67443 invoked by uid 500); 21 Jan 2015 02:03:41 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 67341 invoked by uid 500); 21 Jan 2015 02:03:41 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 67325 invoked by uid 99); 21 Jan 2015 02:03:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jan 2015 02:03:40 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [209.85.218.43] (HELO mail-oi0-f43.google.com) (209.85.218.43) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jan 2015 02:03:36 +0000 Received: by mail-oi0-f43.google.com with SMTP id z81so7025464oif.2 for ; Tue, 20 Jan 2015 18:00:40 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=XL3elVs0MgTsYHVYdFhatmK6xiW9URY/b0d3GbHRbuw=; b=WM+mXmKAN6zCAOwRKVzRS2NNxIm67B+I8qKbv7oC1Z1z8Jx39f3mYQWg9+XMIueQqX Z3jWbR1GLgQ317FEBjP0YNmUBl64gkuohPjLVqw0X4mhRkGz+5LRvGonkap0cyGBL56D CQ77S+YruzA8udHB3hO8heJRVOEVf9rp/iIOOHj0ji4tzNJCtiiKRn3/pQRS1YdPn2JB tSE5FiB4RThJgoVSg/4PcWeqwXVVlo7KLokc0TlXdThDdHE9n7PFqCWmFefycc0ssCQw 9z9viAzJIs2w67Q9jRsmz1fpK4gP5BD8phoj3BdckDcmPM8MMqH/ZKOPlnsp0ENvJtR4 aBuQ== X-Gm-Message-State: ALoCoQkwCoChoiIYw894SP3s+IqSEk5XyGLG+Sc4H50Jv6vtV4sq7s/0Gxedzc5/Rp6h+WBza+Cp MIME-Version: 1.0 X-Received: by 10.182.76.69 with SMTP id i5mr23642121obw.19.1421805640492; Tue, 20 Jan 2015 18:00:40 -0800 (PST) Received: by 10.182.22.196 with HTTP; Tue, 20 Jan 2015 18:00:40 -0800 (PST) In-Reply-To: References: Date: Wed, 21 Jan 2015 11:00:40 +0900 Message-ID: Subject: Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce From: =?UTF-8?B?RHJha2Xrr7zsmIHqt7w=?= To: user Cc: "user@mahout.apache.org" Content-Type: multipart/alternative; boundary=047d7b6760c8243d63050d1fecb1 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b6760c8243d63050d1fecb1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi, How about this ? The large model data stay in HDFS but with many replications and MapReduce program read the model from HDFS. In theory, the replication factor of model data equals with number of data nodes and with the Short Circuit Local Reads function of HDFS datanode, the map or reduce tasks read the model data in their own disks. In this way, maybe use too many usage of HDFS, but the annoying partition problem will be gone. Thanks Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni wrote: > Is there any way.. > Waiting for a reply.I have posted the question every where..but none is > responding back. > I feel like this is the right place to ask doubts. As some of u may came > across the same issue and get stuck. > > On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni > wrote: > >> Yes, One of my friend is implemeting the same. I know global sharing of >> Data is not possible across Hadoop MapReduce. But I need to check if tha= t >> can be done somehow in hadoop Mapreduce also. Because I found some paper= s >> in KNN hadoop also. >> And I trying to compare the performance too. >> >> Hope some pointers can help me. >> >> >> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning >> wrote: >> >>> >>> have you considered implementing using something like spark? That coul= d >>> be much easier than raw map-reduce >>> >>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni < >>> unmeshabiju@gmail.com> wrote: >>> >>>> In KNN like algorithm we need to load model Data into cache for >>>> predicting the records. >>>> >>>> Here is the example for KNN. >>>> >>>> >>>> [image: Inline image 1] >>>> >>>> So if the model will be a large file say1 or 2 GB we will be able to >>>> load them into Distributed cache. >>>> >>>> The one way is to split/partition the model Result into some files and >>>> perform the distance calculation for all records in that file and then= find >>>> the min ditance and max occurance of classlabel and predict the outcom= e. >>>> >>>> How can we parttion the file and perform the operation on these >>>> partition ? >>>> >>>> ie 1 record parttition1,partition2,.... >>>> 2nd record parttition1,partition2,... >>>> >>>> This is what came to my thought. >>>> >>>> Is there any further way. >>>> >>>> Any pointers would help me. >>>> >>>> -- >>>> *Thanks & Regards * >>>> >>>> >>>> *Unmesha Sreeveni U.B* >>>> *Hadoop, Bigdata Developer* >>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>> http://www.unmeshasreeveni.blogspot.in/ >>>> >>>> >>>> >>> >> >> >> -- >> *Thanks & Regards * >> >> >> *Unmesha Sreeveni U.B* >> *Hadoop, Bigdata Developer* >> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >> http://www.unmeshasreeveni.blogspot.in/ >> >> >> > > > -- > *Thanks & Regards * > > > *Unmesha Sreeveni U.B* > *Hadoop, Bigdata Developer* > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* > http://www.unmeshasreeveni.blogspot.in/ > > > --047d7b6760c8243d63050d1fecb1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,=C2=A0

How about this ? The large mo= del data stay in HDFS but with many replications and MapReduce program read= the model from HDFS. In theory, the replication factor of model data equal= s with number of data nodes and with the=C2=A0Short Circuit Local Reads fun= ction of HDFS datanode, the map or reduce tasks read the model data in thei= r own disks.=C2=A0

In this way, maybe use too many= usage of HDFS, but the annoying partition problem will be gone.
=
Thanks

Drake =EB=AF=BC=EC= =98=81=EA=B7=BC Ph.D

On Thu, Jan 15, 2015 at 6:05 PM, unmesha sre= eveni <unmeshabiju@gmail.com> wrote:
Is there any way..
Waiting for a reply.I have po= sted the question every where..but none is responding back.
I feel like thi= s is the right place to ask doubts. As some of u may came across the same i= ssue and get stuck.

On Thu, Jan 15, 2015 = at 12:34 PM, unmesha sreeveni <unmeshabiju@gmail.com> wr= ote:
Yes, One of my friend is= implemeting the same. I know global sharing of Data is not possible across= Hadoop MapReduce. But I need to check if that can be done somehow in hadoo= p Mapreduce also. Because I found some papers in KNN hadoop also.
And I try= ing to compare the performance too.

Hope some pointers can help me.


On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <ted.dunning@gmail.com<= /a>> wrote:

have yo= u considered implementing using something like spark?=C2=A0 That could be m= uch easier than raw map-reduce

On Wed, Jan 14, 2015 at 10:06 PM, unmesh= a sreeveni <unmeshabiju@gmail.com> wrote:

In KNN like algorithm we need to load model Data into cache= for predicting the records.

Here is the example for KNN.


=3D"Inline

So if the model= will be a large file say1 or 2 GB we will be able to load them into Distri= buted cache.

The one way is to split/partition the model Result into some = files and perform the distance calculation for all records in that file and= then find the min ditance and max occurance of classlabel and predict the = outcome.

How can we parttion the file and perform the operation on these p= artition ?

ie =
 1 record <Distance> parttition1,partition2,....
     2nd record <Distance> parttition1,partition2,...

This is what came to my thought.

Is there any further way.

<= p style=3D"margin:0px 0px 1em;padding:0px;border:0px;font-size:13.636363029= 47998px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Ari= al,'Liberation Sans','DejaVu Sans',sans-serif;line-height:1= 7.804800033569336px;background-image:initial;background-repeat:initial">Any= pointers would help me.

=
--
Thanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=





--
=
T= hanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=




--
Thanks & = Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=


--047d7b6760c8243d63050d1fecb1--