Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EC16E175DF for ; Thu, 15 Jan 2015 07:07:36 +0000 (UTC) Received: (qmail 75300 invoked by uid 500); 15 Jan 2015 07:07:33 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 75179 invoked by uid 500); 15 Jan 2015 07:07:33 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 75167 invoked by uid 99); 15 Jan 2015 07:07:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jan 2015 07:07:33 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of unmeshabiju@gmail.com designates 209.85.223.174 as permitted sender) Received: from [209.85.223.174] (HELO mail-ie0-f174.google.com) (209.85.223.174) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jan 2015 07:07:07 +0000 Received: by mail-ie0-f174.google.com with SMTP id at20so13364829iec.5 for ; Wed, 14 Jan 2015 23:04:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=nnpJmtikikFn/hbJL6584Ow8VqdPvmRKb0kTn22cxq0=; b=G7XcHGVAuZa71khe9eb7tU2wbb2vLcjr7WSQRZuE2jJ/860p3Diz6KOiKBv4w7cG/8 HytzZHkvs73G+9cphmLh988QheMkJO0jGf/FHhb/M6ROy7ZkCPNAcNwzVszzQYNk42Kq nuDVK2EEmha9Na+KhYCCtKDZ92yyQ4jFgrX1cGcX9VzRZWWfoxovcQEpN6xVyPS7QQVF Kv7zWo5gNjtoDf2eBuFO94hTPAzWFz7BT9ddDW5iWdcrI7GUPX9rIJCts12d6ryBfSAs XJc01LwzbfBT3kAp0rgEumw8Rex81LNSVcD7oAowQHOZlwNRy3FhlYUBxd7J0w1pnlAp a8og== X-Received: by 10.107.149.13 with SMTP id x13mr8369083iod.35.1421305490597; Wed, 14 Jan 2015 23:04:50 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.134.145 with HTTP; Wed, 14 Jan 2015 23:04:10 -0800 (PST) In-Reply-To: References: From: unmesha sreeveni Date: Thu, 15 Jan 2015 12:34:10 +0530 Message-ID: Subject: Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce To: User Hadoop Cc: "user@mahout.apache.org" Content-Type: multipart/alternative; boundary=001a1140fd50e27e31050cab7819 X-Virus-Checked: Checked by ClamAV on apache.org --001a1140fd50e27e31050cab7819 Content-Type: text/plain; charset=UTF-8 Yes, One of my friend is implemeting the same. I know global sharing of Data is not possible across Hadoop MapReduce. But I need to check if that can be done somehow in hadoop Mapreduce also. Because I found some papers in KNN hadoop also. And I trying to compare the performance too. Hope some pointers can help me. On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning wrote: > > have you considered implementing using something like spark? That could > be much easier than raw map-reduce > > On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni > wrote: > >> In KNN like algorithm we need to load model Data into cache for >> predicting the records. >> >> Here is the example for KNN. >> >> >> [image: Inline image 1] >> >> So if the model will be a large file say1 or 2 GB we will be able to load >> them into Distributed cache. >> >> The one way is to split/partition the model Result into some files and >> perform the distance calculation for all records in that file and then find >> the min ditance and max occurance of classlabel and predict the outcome. >> >> How can we parttion the file and perform the operation on these partition >> ? >> >> ie 1 record parttition1,partition2,.... >> 2nd record parttition1,partition2,... >> >> This is what came to my thought. >> >> Is there any further way. >> >> Any pointers would help me. >> >> -- >> *Thanks & Regards * >> >> >> *Unmesha Sreeveni U.B* >> *Hadoop, Bigdata Developer* >> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >> http://www.unmeshasreeveni.blogspot.in/ >> >> >> > -- *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ --001a1140fd50e27e31050cab7819 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Yes, One of my friend is implemeting the same. I know global sh= aring of Data is not possible across Hadoop MapReduce. But I need to check = if that can be done somehow in hadoop Mapreduce also. Because I found some = papers in KNN hadoop also.
And I trying to compare the performance too.
Hope some pointers can help me.

On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning = <ted.dunning@gmail.com> wrote:

have you considered implementing using something like = spark?=C2=A0 That could be much easier than raw map-reduce
=

On Wed, Jan = 14, 2015 at 10:06 PM, unmesha sreeveni <unmeshabiju@gmail.com><= /span> wrote:

In KNN like algorithm we need t= o load model Data into cache for predicting the records.

Here is the examp= le for KNN.


3D"Inline

So if the model will be a large file say1 or 2 GB we will be ab= le to load them into Distributed cache.

The one way is to split/partition = the model Result into some files and perform the distance calculation for a= ll records in that file and then find the min ditance and max occurance of = classlabel and predict the outcome.

How can we parttion the file and perfo= rm the operation on these partition ?

ie  1 record <Distance> parttition1,partition=
2,....
     2nd record <Distance> parttition1,partition2,...

This is what came to my thought.

Is there any further way.

<= p style=3D"margin:0px 0px 1em;padding:0px;border:0px;font-size:13.636363029= 47998px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Ari= al,'Liberation Sans','DejaVu Sans',sans-serif;line-height:1= 7.804800033569336px;background-image:initial;background-repeat:initial">Any= pointers would help me.

=
--
Thanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=





--
=
T= hanks & Regards

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
=
Centre for Cyber Security | Amrita Vishwa V= idyapeetham
=

--001a1140fd50e27e31050cab7819--