Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 69DD8DC5A for ; Tue, 4 Sep 2012 08:17:54 +0000 (UTC) Received: (qmail 48766 invoked by uid 500); 4 Sep 2012 08:17:49 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 48567 invoked by uid 500); 4 Sep 2012 08:17:49 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 48556 invoked by uid 99); 4 Sep 2012 08:17:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Sep 2012 08:17:49 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=FSL_RCVD_USER,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [141.51.167.101] (HELO gundel.cs.uni-kassel.de) (141.51.167.101) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Sep 2012 08:17:43 +0000 Received: from localhost (localhost [127.0.0.1]) by gundel.cs.uni-kassel.de (Postfix) with ESMTP id 7EDA01AC3D9 for ; Tue, 4 Sep 2012 10:17:21 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at gundel.cs.uni-kassel.de Received: from gundel.cs.uni-kassel.de ([127.0.0.1]) by localhost (gundel.cs.uni-kassel.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WH6QYR4Bv+-U for ; Tue, 4 Sep 2012 10:17:15 +0200 (CEST) Received: from [141.51.123.117] (fudd.cs.uni-kassel.de [141.51.123.117]) by gundel.cs.uni-kassel.de (Postfix) with ESMTPSA id A34671AC3D7 for ; Tue, 4 Sep 2012 10:17:15 +0200 (CEST) Message-ID: <5045B91B.9080202@cs.uni-kassel.de> Date: Tue, 04 Sep 2012 10:17:31 +0200 From: =?ISO-8859-1?Q?Bj=F6rn-Elmar_Macek?= User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120824 Thunderbird/15.0 MIME-Version: 1.0 To: user@hadoop.apache.org Subject: Re: best way to join? References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi Dexter, i think, what you want is a clustering of points based on the euclidian distance or density based clustering ( http://en.wikipedia.org/wiki/Cluster_analysis ). I bet there are some implemented quite well in Mahout already: afaik this is the datamining framework based on Hadoop. Best luck! Elmar Am 27.08.2012 22:15, schrieb dexter morgan: > Dear list, > > Lets say i have a file, like this: > id \t at,tlng <-- structure > > 1\t40.123,-50.432 > 2\t41.431,-43.32 > ... > ... > lets call it: 'points.txt' > I'm trying to build a map-reduce job that runs over this BIG points > file and it should output > a file, that will look like: > id[lat,lng] \t [list of points in JSON standard] <--- structure > > 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]] > 2[41.431,-43.32]\t[[40.123,-50.432],...[,]] > ... > > Basically it should run on ITSELF, and grab for each point the N (it > will be an argument for the job) CLOSEST points (the mappers should > calculate the distance).. > > Distributed cache is not an option, what else? not sure if to > classify it as a map-join , reduce-join or both? > Would you do this in HIVE some how? > Is it feasible in a single job? > > Would LOVE to hear your suggestions, code (if you think its not that > hard) or what not. > BTW using CDH3 - rev 3 (20.23) > > Thanks!