Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C35B8AD1 for ; Tue, 28 Aug 2012 13:48:37 +0000 (UTC) Received: (qmail 34036 invoked by uid 500); 28 Aug 2012 13:48:33 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 33809 invoked by uid 500); 28 Aug 2012 13:48:32 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 33802 invoked by uid 99); 28 Aug 2012 13:48:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Aug 2012 13:48:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dextermorgan4u@gmail.com designates 209.85.217.176 as permitted sender) Received: from [209.85.217.176] (HELO mail-lb0-f176.google.com) (209.85.217.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Aug 2012 13:48:26 +0000 Received: by lboi15 with SMTP id i15so3645466lbo.35 for ; Tue, 28 Aug 2012 06:48:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=wXsCoXSpEfSHk5zEANtfWd8rFf2C1VGEplU91X6mXaU=; b=CaosSR2NKj90e4SNeoysXT/I7Oip62i2BtffWD6Yf68/jefkP5URENokC9t4iPqRwK fRyQtb6mI4JGAtrRZmE/a2y9RTBJlEeIjAsOqcm9O8IEPHNxcJRuuhnv6VdJjKF1dmwx a1pTN/5vN/1dasQOK5aWJagzQalZ1baG1Y1ubH5yFWqgKPMSMlMd52fCYFR02SnCjRF+ +GX1yG8b4c0NN7JHdJFll2RfeTsreZ+gQ2ECHKfyfQ1Ctdq1LhB/2wH5YFdgQmnY/Zyu vSQSPkovDXTPrHZzO/vwNaG/aqFKqqRatuhKtfqN9xjyqr+GGxDB3/jubjfZAfAYw7eY N/PA== MIME-Version: 1.0 Received: by 10.112.26.197 with SMTP id n5mr2392594lbg.18.1346161685560; Tue, 28 Aug 2012 06:48:05 -0700 (PDT) Received: by 10.112.86.137 with HTTP; Tue, 28 Aug 2012 06:48:05 -0700 (PDT) In-Reply-To: References: Date: Tue, 28 Aug 2012 16:48:05 +0300 Message-ID: Subject: Re: best way to join? From: dexter morgan To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=bcaec555571814006904c853b15d --bcaec555571814006904c853b15d Content-Type: text/plain; charset=ISO-8859-1 Dear Ted, I understand your solution ( i think) , didn't think of that, in that particular way. I think that lets say i have 1M data-points, and running knn , that the k=1M and n=10 (each point is a cluster that requires up to 10 points) is an overkill. How can i achieve the same result WITHOUT using mahout, just running on the dataset , i even think it'll be in the same complexity (o(n^2)) and calculating the distance between each indifferent points? and maybe the reducer would just sort them in DESC order for each point. Thank you! On Tue, Aug 28, 2012 at 12:52 AM, Ted Dunning wrote: > Mahout is getting some very fast knn code in version 0.8. > > The basic work flow is that you would first do a large-scale clustering of > the data. Then you would make a second pass using the clustering to > facilitate fast search for nearby points. > > The clustering will require two map-reduce jobs, one to find the cluster > definitions and the second map-only to assign points to clusters in a form > to be used by the second pass. The second pass is a map-only process as > well. > > In order to make this as efficient as possible, it is desirable to use a > distribution of Hadoop that allows you to directly map the cluster data > structures into shared memory. IF you have NFS access to your cluster, > this is easy. Otherwise, it is considerably trickier. > > > On Mon, Aug 27, 2012 at 4:15 PM, dexter morgan wrote: > >> Dear list, >> >> Lets say i have a file, like this: >> id \t at,tlng <-- structure >> >> 1\t40.123,-50.432 >> 2\t41.431,-43.32 >> ... >> ... >> lets call it: 'points.txt' >> I'm trying to build a map-reduce job that runs over this BIG points file >> and it should output >> a file, that will look like: >> id[lat,lng] \t [list of points in JSON standard] <--- structure >> >> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]] >> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]] >> ... >> >> Basically it should run on ITSELF, and grab for each point the N (it will >> be an argument for the job) CLOSEST points (the mappers should calculate >> the distance).. >> >> Distributed cache is not an option, what else? not sure if to classify >> it as a map-join , reduce-join or both? >> Would you do this in HIVE some how? >> Is it feasible in a single job? >> >> Would LOVE to hear your suggestions, code (if you think its not that >> hard) or what not. >> BTW using CDH3 - rev 3 (20.23) >> >> Thanks! >> > > --bcaec555571814006904c853b15d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Dear Ted,

I understand your solution ( = i think) , didn't think of that, in that particular way.
I th= ink that lets say i have 1M data-points, and running knn , that the k=3D1M = and n=3D10 (each point is a cluster that requires up to 10 points)=A0
is an overkill.

How can i achieve the same re= sult WITHOUT using mahout, just running on the dataset , i even think it= 9;ll be in the same complexity (o(n^2))
and calculating the dista= nce between each indifferent points?=A0

and maybe the reducer would just sort them in DESC orde= r for each point.=A0

Thank you!

On Tue, Aug 28, 2012 at 12:52 AM, Ted Dunning <tdunni= ng@maprtech.com> wrote:
Mahout is getting some very fast knn code in= version 0.8.

The basic work flow is that you would firs= t do a large-scale clustering of the data. =A0Then you would make a second = pass using the clustering to facilitate fast search for nearby points.

The clustering will require two map-reduce jobs, one to= find the cluster definitions and the second map-only to assign points to c= lusters in a form to be used by the second pass. =A0The second pass is a ma= p-only process as well.

In order to make this as efficient as possible, it is d= esirable to use a distribution of Hadoop that allows you to directly map th= e cluster data structures into shared memory. =A0IF you have NFS access to = your cluster, this is easy. =A0Otherwise, it is considerably trickier.


On Mon, Aug 27, 2012 at 4:15 PM, dexter morg= an <dextermorgan4u@gmail.com> wrote:
Dear list,

Lets say i have a file, like= this:
id \t at,tlng <-- structure

1\= t40.123,-50.432
2\t41.431,-43.32
...
...
lets call it: 'points.txt'
I'm trying to build a= map-reduce job that runs over this BIG points file and it should output
a file, that will look like:
id[lat,lng] \t [list of poin= ts in JSON standard] <--- structure

1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]= ]
2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
...
=

Basically it should run on ITSELF, and grab for each po= int the N (it will be an argument for the job) CLOSEST points (the mappers = should calculate the distance)..

Distributed cache is not an option, what else? =A0not s= ure if to classify it as a map-join , reduce-join or both?
Would = you do this in HIVE some how?=A0
Is it feasible in a single job?<= /div>

Would LOVE to hear your suggestions, code (if you think= its not that hard) or what not.
BTW using CDH3 - rev 3 (20.23)

Thanks!


--bcaec555571814006904c853b15d--