Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of dextermorgan4u@gmail.com
 designates 209.85.217.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAND0qzvR7B-5NXNTLwN7mztNotjqM89ZJBKpZ471jStfP9W8Fg@mail.gmail.com>
References: 
 <CABtCYFzx4k36x0-V8M8T_ovSc6sd4EL-Hhn56jNHRbG5UGbQSA@mail.gmail.com>
	<CAND0qzvR7B-5NXNTLwN7mztNotjqM89ZJBKpZ471jStfP9W8Fg@mail.gmail.com>
Date: Tue, 28 Aug 2012 16:48:05 +0300
Message-ID: 
 <CABtCYFwfD7ZsV88Wq19WLGOfHYEZsFFSx6dZBT__YJGwabjVYQ@mail.gmail.com>
Subject: Re: best way to join?
From: dexter morgan <dextermorgan4u@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=bcaec555571814006904c853b15d

--bcaec555571814006904c853b15d
Content-Type: text/plain; charset=ISO-8859-1

Dear Ted,

I understand your solution ( i think) , didn't think of that, in that
particular way.
I think that lets say i have 1M data-points, and running knn , that the
k=1M and n=10 (each point is a cluster that requires up to 10 points)
is an overkill.

How can i achieve the same result WITHOUT using mahout, just running on the
dataset , i even think it'll be in the same complexity (o(n^2))
and calculating the distance between each indifferent points?

and maybe the reducer would just sort them in DESC order for each point.

Thank you!

On Tue, Aug 28, 2012 at 12:52 AM, Ted Dunning <tdunning@maprtech.com> wrote:

> Mahout is getting some very fast knn code in version 0.8.
>
> The basic work flow is that you would first do a large-scale clustering of
> the data.  Then you would make a second pass using the clustering to
> facilitate fast search for nearby points.
>
> The clustering will require two map-reduce jobs, one to find the cluster
> definitions and the second map-only to assign points to clusters in a form
> to be used by the second pass.  The second pass is a map-only process as
> well.
>
> In order to make this as efficient as possible, it is desirable to use a
> distribution of Hadoop that allows you to directly map the cluster data
> structures into shared memory.  IF you have NFS access to your cluster,
> this is easy.  Otherwise, it is considerably trickier.
>
>
> On Mon, Aug 27, 2012 at 4:15 PM, dexter morgan <dextermorgan4u@gmail.com>wrote:
>
>> Dear list,
>>
>> Lets say i have a file, like this:
>> id \t at,tlng <-- structure
>>
>> 1\t40.123,-50.432
>> 2\t41.431,-43.32
>> ...
>> ...
>> lets call it: 'points.txt'
>> I'm trying to build a map-reduce job that runs over this BIG points file
>> and it should output
>> a file, that will look like:
>> id[lat,lng] \t [list of points in JSON standard] <--- structure
>>
>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
>> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
>> ...
>>
>> Basically it should run on ITSELF, and grab for each point the N (it will
>> be an argument for the job) CLOSEST points (the mappers should calculate
>> the distance)..
>>
>> Distributed cache is not an option, what else?  not sure if to classify
>> it as a map-join , reduce-join or both?
>> Would you do this in HIVE some how?
>> Is it feasible in a single job?
>>
>> Would LOVE to hear your suggestions, code (if you think its not that
>> hard) or what not.
>> BTW using CDH3 - rev 3 (20.23)
>>
>> Thanks!
>>
>
>

--bcaec555571814006904c853b15d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Dear Ted,<div><br></div><div>I understand your solution ( =
i think) , didn&#39;t think of that, in that particular way.</div><div>I th=
ink that lets say i have 1M data-points, and running knn , that the k=3D1M =
and n=3D10 (each point is a cluster that requires up to 10 points)=A0</div>
<div>is an overkill.</div><div><br></div><div>How can i achieve the same re=
sult WITHOUT using mahout, just running on the dataset , i even think it=
9;ll be in the same complexity (o(n^2))</div><div>and calculating the dista=
nce between each indifferent points?=A0</div>
<div><br></div><div>and maybe the reducer would just sort them in DESC orde=
r for each point.=A0</div><div><br></div><div>Thank you!<br><br><div class=
=3D"gmail_quote">On Tue, Aug 28, 2012 at 12:52 AM, Ted Dunning <span dir=3D=
"ltr">&lt;<a href=3D"mailto:tdunning@maprtech.com" target=3D"_blank">tdunni=
ng@maprtech.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Mahout is getting some very fast knn code in=
 version 0.8.<div><br></div><div>The basic work flow is that you would firs=
t do a large-scale clustering of the data. =A0Then you would make a second =
pass using the clustering to facilitate fast search for nearby points.</div=
>


<div><br></div><div>The clustering will require two map-reduce jobs, one to=
 find the cluster definitions and the second map-only to assign points to c=
lusters in a form to be used by the second pass. =A0The second pass is a ma=
p-only process as well.</div>


<div><br></div><div>In order to make this as efficient as possible, it is d=
esirable to use a distribution of Hadoop that allows you to directly map th=
e cluster data structures into shared memory. =A0IF you have NFS access to =
your cluster, this is easy. =A0Otherwise, it is considerably trickier.<div>
<div class=3D"h5"><br>

<br><div class=3D"gmail_quote">On Mon, Aug 27, 2012 at 4:15 PM, dexter morg=
an <span dir=3D"ltr">&lt;<a href=3D"mailto:dextermorgan4u@gmail.com" target=
=3D"_blank">dextermorgan4u@gmail.com</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex">


<div dir=3D"ltr">Dear list,<div><br></div><div>Lets say i have a file, like=
 this:</div><div>id \t at,tlng &lt;-- structure</div><div><br></div><div>1\=
t40.123,-50.432</div><div>2\t41.431,-43.32</div><div>...</div><div>...</div=
>


<div>lets call it: &#39;points.txt&#39;</div><div>I&#39;m trying to build a=
 map-reduce job that runs over this BIG points file and it should output</d=
iv><div>a file, that will look like:</div><div>id[lat,lng] \t [list of poin=
ts in JSON standard] &lt;--- structure</div>


<div><br></div><div>1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]=
]</div><div>2[41.431,-43.32]\t[[40.123,-50.432],...[,]]</div><div>...</div>=
<div><br></div><div>Basically it should run on ITSELF, and grab for each po=
int the N (it will be an argument for the job) CLOSEST points (the mappers =
should calculate the distance)..</div>


<div><br></div><div>Distributed cache is not an option, what else? =A0not s=
ure if to classify it as a map-join , reduce-join or both?</div><div>Would =
you do this in HIVE some how?=A0</div><div>Is it feasible in a single job?<=
/div>


<div><br></div><div>Would LOVE to hear your suggestions, code (if you think=
 its not that hard) or what not.</div><div>BTW using CDH3 - rev 3 (20.23)</=
div><div><br></div><div>Thanks!</div></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div></div>

--bcaec555571814006904c853b15d--