Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: error (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CACp0qUGCHUvSaWZ3ibzj1aWesUVTvxCVEi1jticVCP=LoLDOTg@mail.gmail.com>
References: 
 <CACp0qUGtdyfuR-unpx6XHu3jGyd2N748tX2cuoE7YXPb+u3Epw@mail.gmail.com>
	<CAJwFCa14Sa7fxLfVKzOn4ijQjfB+WoNry=qDDpaSaQZ+x1SN+A@mail.gmail.com>
	<CACp0qUF3ebtYdE=h0x=1hhw5WsBL5UMqbRGFmF469kUFynk77g@mail.gmail.com>
	<CACp0qUGCHUvSaWZ3ibzj1aWesUVTvxCVEi1jticVCP=LoLDOTg@mail.gmail.com>
Date: Wed, 21 Jan 2015 11:00:40 +0900
Message-ID: 
 <CAE422GDNv2vddsRVszb0JWGsVe3Tpa3-11_ynQnA3gb5Gz2apA@mail.gmail.com>
Subject: Re: How to partition a file to smaller size for performing KNN in
 hadoop mapreduce
From: =?UTF-8?B?RHJha2Xrr7zsmIHqt7w=?= <drake.min@nexr.com>
To: user <user@hadoop.apache.org>
Cc: "user@mahout.apache.org" <user@mahout.apache.org>
Content-Type: multipart/alternative; boundary=047d7b6760c8243d63050d1fecb1

--047d7b6760c8243d63050d1fecb1
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi,

How about this ? The large model data stay in HDFS but with many
replications and MapReduce program read the model from HDFS. In theory, the
replication factor of model data equals with number of data nodes and with
the Short Circuit Local Reads function of HDFS datanode, the map or reduce
tasks read the model data in their own disks.

In this way, maybe use too many usage of HDFS, but the annoying partition
problem will be gone.

Thanks

Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D

On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <unmeshabiju@gmail.com>
wrote:

> Is there any way..
> Waiting for a reply.I have posted the question every where..but none is
> responding back.
> I feel like this is the right place to ask doubts. As some of u may came
> across the same issue and get stuck.
>
> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <unmeshabiju@gmail.com=
>
> wrote:
>
>> Yes, One of my friend is implemeting the same. I know global sharing of
>> Data is not possible across Hadoop MapReduce. But I need to check if tha=
t
>> can be done somehow in hadoop Mapreduce also. Because I found some paper=
s
>> in KNN hadoop also.
>> And I trying to compare the performance too.
>>
>> Hope some pointers can help me.
>>
>>
>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>>
>>>
>>> have you considered implementing using something like spark?  That coul=
d
>>> be much easier than raw map-reduce
>>>
>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>> unmeshabiju@gmail.com> wrote:
>>>
>>>> In KNN like algorithm we need to load model Data into cache for
>>>> predicting the records.
>>>>
>>>> Here is the example for KNN.
>>>>
>>>>
>>>> [image: Inline image 1]
>>>>
>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>> load them into Distributed cache.
>>>>
>>>> The one way is to split/partition the model Result into some files and
>>>> perform the distance calculation for all records in that file and then=
 find
>>>> the min ditance and max occurance of classlabel and predict the outcom=
e.
>>>>
>>>> How can we parttion the file and perform the operation on these
>>>> partition ?
>>>>
>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>      2nd record <Distance> parttition1,partition2,...
>>>>
>>>> This is what came to my thought.
>>>>
>>>> Is there any further way.
>>>>
>>>> Any pointers would help me.
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

--047d7b6760c8243d63050d1fecb1
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,=C2=A0<div><br></div><div>How about this ? The large mo=
del data stay in HDFS but with many replications and MapReduce program read=
 the model from HDFS. In theory, the replication factor of model data equal=
s with number of data nodes and with the=C2=A0Short Circuit Local Reads fun=
ction of HDFS datanode, the map or reduce tasks read the model data in thei=
r own disks.=C2=A0</div><div><br></div><div>In this way, maybe use too many=
 usage of HDFS, but the annoying partition problem will be gone.</div><div>=
<br></div><div>Thanks</div></div><div class=3D"gmail_extra"><br clear=3D"al=
l"><div><div class=3D"gmail_signature"><div dir=3D"ltr">Drake =EB=AF=BC=EC=
=98=81=EA=B7=BC Ph.D</div></div></div>
<br><div class=3D"gmail_quote">On Thu, Jan 15, 2015 at 6:05 PM, unmesha sre=
eveni <span dir=3D"ltr">&lt;<a href=3D"mailto:unmeshabiju@gmail.com" target=
=3D"_blank">unmeshabiju@gmail.com</a>&gt;</span> wrote:<br><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pad=
ding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-=
family:verdana,sans-serif">Is there any way..</div><div class=3D"gmail_defa=
ult" style=3D"font-family:verdana,sans-serif">Waiting for a reply.I have po=
sted the question every where..but none is responding back.</div><div class=
=3D"gmail_default" style=3D"font-family:verdana,sans-serif">I feel like thi=
s is the right place to ask doubts. As some of u may came across the same i=
ssue and get stuck.</div></div><div class=3D"HOEnZb"><div class=3D"h5"><div=
 class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, Jan 15, 2015 =
at 12:34 PM, unmesha sreeveni <span dir=3D"ltr">&lt;<a href=3D"mailto:unmes=
habiju@gmail.com" target=3D"_blank">unmeshabiju@gmail.com</a>&gt;</span> wr=
ote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail=
_default" style=3D"font-family:verdana,sans-serif">Yes, One of my friend is=
 implemeting the same. I know global sharing of Data is not possible across=
 Hadoop MapReduce. But I need to check if that can be done somehow in hadoo=
p Mapreduce also. Because I found some papers in KNN hadoop also.</div><div=
 class=3D"gmail_default" style=3D"font-family:verdana,sans-serif">And I try=
ing to compare the performance too.</div><div class=3D"gmail_default" style=
=3D"font-family:verdana,sans-serif"><br></div><div class=3D"gmail_default" =
style=3D"font-family:verdana,sans-serif">Hope some pointers can help me.</d=
iv><div><div><div class=3D"gmail_default" style=3D"font-family:verdana,sans=
-serif"><br></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote"=
>On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:ted.dunning@gmail.com" target=3D"_blank">ted.dunning@gmail.com<=
/a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);=
border-left-style:solid;padding-left:1ex"><div dir=3D"ltr"><br><div>have yo=
u considered implementing using something like spark?=C2=A0 That could be m=
uch easier than raw map-reduce</div></div><div><div><div class=3D"gmail_ext=
ra"><br><div class=3D"gmail_quote">On Wed, Jan 14, 2015 at 10:06 PM, unmesh=
a sreeveni <span dir=3D"ltr">&lt;<a href=3D"mailto:unmeshabiju@gmail.com" t=
arget=3D"_blank">unmeshabiju@gmail.com</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:=
1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left=
:1ex"><div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:ve=
rdana,sans-serif"><p style=3D"margin:0px 0px 1em;padding:0px;border:0px;fon=
t-size:13.63636302947998px;vertical-align:baseline;clear:both;color:rgb(0,0=
,0);font-family:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-=
serif;line-height:17.804800033569336px;background-image:initial;background-=
repeat:initial">In KNN like algorithm we need to load model Data into cache=
 for predicting the records.</p><p style=3D"margin:0px 0px 1em;padding:0px;=
border:0px;font-size:13.63636302947998px;vertical-align:baseline;clear:both=
;color:rgb(0,0,0);font-family:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu S=
ans&#39;,sans-serif;line-height:17.804800033569336px;background-image:initi=
al;background-repeat:initial">Here is the example for KNN.</p><p style=3D"m=
argin:0px 0px 1em;padding:0px;border:0px;font-size:13.63636302947998px;vert=
ical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,&#39;Libe=
ration Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif;line-height:17.8048000335=
69336px;background-image:initial;background-repeat:initial"><br></p><p styl=
e=3D"margin:0px 0px 1em;padding:0px;border:0px;font-size:13.63636302947998p=
x;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,=
9;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif;line-height:17.8048=
00033569336px;background-image:initial;background-repeat:initial"><img alt=
=3D"Inline image 1" width=3D"506" height=3D"209"><br></p><div><p style=3D"m=
argin:0px 0px 1em;padding:0px;border:0px;font-size:13.63636302947998px;vert=
ical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Arial,&#39;Libe=
ration Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif;line-height:17.8048000335=
69336px;background-image:initial;background-repeat:initial">So if the model=
 will be a large file say1 or 2 GB we will be able to load them into Distri=
buted cache.</p><p style=3D"margin:0px 0px 1em;padding:0px;border:0px;font-=
size:13.63636302947998px;vertical-align:baseline;clear:both;color:rgb(0,0,0=
);font-family:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-se=
rif;line-height:17.804800033569336px;background-image:initial;background-re=
peat:initial">The one way is to split/partition the model Result into some =
files and perform the distance calculation for all records in that file and=
 then find the min ditance and max occurance of classlabel and predict the =
outcome.</p><p style=3D"margin:0px 0px 1em;padding:0px;border:0px;font-size=
:13.63636302947998px;vertical-align:baseline;clear:both;color:rgb(0,0,0);fo=
nt-family:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif;=
line-height:17.804800033569336px;background-image:initial;background-repeat=
:initial">How can we parttion the file and perform the operation on these p=
artition ?</p><pre style=3D"margin-top:0px;margin-bottom:10px;padding:5px;b=
order:0px;font-size:13.63636302947998px;vertical-align:baseline;font-family=
:Consolas,Menlo,Monaco,&#39;Lucida Console&#39;,&#39;Liberation Mono&#39;,&=
#39;DejaVu Sans Mono&#39;,&#39;Bitstream Vera Sans Mono&#39;,&#39;Courier N=
ew&#39;,monospace,serif;overflow:auto;width:auto;max-height:600px;word-wrap=
:normal;color:rgb(0,0,0);line-height:17.804800033569336px;background:rgb(23=
8,238,238)"><code style=3D"margin:0px;padding:0px;border:0px;font-size:13.6=
3636302947998px;vertical-align:baseline;font-family:Consolas,Menlo,Monaco,&=
#39;Lucida Console&#39;,&#39;Liberation Mono&#39;,&#39;DejaVu Sans Mono&#39=
;,&#39;Bitstream Vera Sans Mono&#39;,&#39;Courier New&#39;,monospace,serif;=
white-space:inherit;background-image:initial;background-repeat:initial">ie =
 1 record &lt;Distance&gt; parttition1,partition2,....
     2nd record &lt;Distance&gt; parttition1,partition2,...
</code></pre><p style=3D"margin:0px 0px 1em;padding:0px;border:0px;font-siz=
e:13.63636302947998px;vertical-align:baseline;clear:both;color:rgb(0,0,0);f=
ont-family:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif=
;line-height:17.804800033569336px;background-image:initial;background-repea=
t:initial">This is what came to my thought.</p><p style=3D"margin:0px 0px 1=
em;padding:0px;border:0px;font-size:13.63636302947998px;vertical-align:base=
line;clear:both;color:rgb(0,0,0);font-family:Arial,&#39;Liberation Sans&#39=
;,&#39;DejaVu Sans&#39;,sans-serif;line-height:17.804800033569336px;backgro=
und-image:initial;background-repeat:initial">Is there any further way.</p><=
p style=3D"margin:0px 0px 1em;padding:0px;border:0px;font-size:13.636363029=
47998px;vertical-align:baseline;clear:both;color:rgb(0,0,0);font-family:Ari=
al,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif;line-height:1=
7.804800033569336px;background-image:initial;background-repeat:initial">Any=
 pointers would help me.</p></div></div><span><font color=3D"#888888"><div>=
<br></div>-- <br><div><div dir=3D"ltr"><div><div dir=3D"ltr"><b><font color=
=3D"#3d85c6"><i>Thanks &amp; Regards</i>
</font></b><div><i><b><font color=3D"#3d85c6"><br></font></b></i></div><div=
><b><font color=3D"#3d85c6">Unmesha Sreeveni U.B<i><br></i></font></b></div=
><div><b><font color=3D"#3d85c6">Hadoop, Bigdata Developer</font></b></div>=
<div><b><font color=3D"#3d85c6">Centre for Cyber Security | Amrita Vishwa V=
idyapeetham</font></b><br></div><div style=3D"color:rgb(102,0,0)"><a href=
=3D"http://www.unmeshasreeveni.blogspot.in/" target=3D"_blank">http://www.u=
nmeshasreeveni.blogspot.in/</a><br></div><div style=3D"color:rgb(102,0,0)">=
<br></div><i><span><br></span></i></div></div></div></div>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div><div dir=3D"ltr"><div><div dir=3D"ltr"><b><font color=3D"#3d85c6"><i>T=
hanks &amp; Regards</i>
</font></b><div><i><b><font color=3D"#3d85c6"><br></font></b></i></div><div=
><b><font color=3D"#3d85c6">Unmesha Sreeveni U.B<i><br></i></font></b></div=
><div><b><font color=3D"#3d85c6">Hadoop, Bigdata Developer</font></b></div>=
<div><b><font color=3D"#3d85c6">Centre for Cyber Security | Amrita Vishwa V=
idyapeetham</font></b><br></div><div style=3D"color:rgb(102,0,0)"><a href=
=3D"http://www.unmeshasreeveni.blogspot.in/" target=3D"_blank">http://www.u=
nmeshasreeveni.blogspot.in/</a><br></div><div style=3D"color:rgb(102,0,0)">=
<br></div><i><span><br></span></i></div></div></div></div>
</div></div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div><div di=
r=3D"ltr"><div><div dir=3D"ltr"><b><font color=3D"#3d85c6"><i>Thanks &amp; =
Regards</i>
</font></b><div><i><b><font color=3D"#3d85c6"><br></font></b></i></div><div=
><b><font color=3D"#3d85c6">Unmesha Sreeveni U.B<i><br></i></font></b></div=
><div><b><font color=3D"#3d85c6">Hadoop, Bigdata Developer</font></b></div>=
<div><b><font color=3D"#3d85c6">Centre for Cyber Security | Amrita Vishwa V=
idyapeetham</font></b><br></div><div style=3D"color:rgb(102,0,0)"><a href=
=3D"http://www.unmeshasreeveni.blogspot.in/" target=3D"_blank">http://www.u=
nmeshasreeveni.blogspot.in/</a><br></div><div style=3D"color:rgb(102,0,0)">=
<br></div><i><span><br></span></i></div></div></div></div>
</div>
</div></div></blockquote></div><br></div>

--047d7b6760c8243d63050d1fecb1--