Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0C07BD4B1 for ; Mon, 27 Aug 2012 20:46:13 +0000 (UTC) Received: (qmail 74921 invoked by uid 500); 27 Aug 2012 20:46:08 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 74756 invoked by uid 500); 27 Aug 2012 20:46:08 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Delivered-To: moderator for user@hadoop.apache.org Received: (qmail 71057 invoked by uid 99); 27 Aug 2012 20:16:00 -0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dextermorgan4u@gmail.com designates 209.85.217.176 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=LUai8CVEBytsDHKiTIv9b3ZhjqRZmcmzsZLg9mmF0Q4=; b=aGiUxCvGm/Bc4k+XnsKGRSgUzuIqGudEkkACZfFn4JEp9OCwPsxfxz8GQa1Vi+RlMD rC/awwl21aeetoPFLwdwK9B8EtAT18BSPmlkv3Sm5WtSNCGr5M2kwOMmILxhWj3bxrri ZO3632m93waTerYiy7iKO6kMb2q3JJVN0d82XGOTbyexTuCbu+sYeCOFPaVTMI5rzFAD 4niK5SlwA1V0gaAuO0AEq9T13NuV+HPiZP1s149+EgU5FUA5AVnXfTDzRyZjTm7KcpXZ YEAnKBAhzK4anE0fNM/yw7EEYrLYriVEthIEW23Xik51NQGRHJvD1kdREhA4ZgT9SwQd F4Sg== MIME-Version: 1.0 Date: Mon, 27 Aug 2012 23:15:32 +0300 Message-ID: Subject: best way to join? From: dexter morgan To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=e89a8f234567df70ab04c844fcd6 --e89a8f234567df70ab04c844fcd6 Content-Type: text/plain; charset=ISO-8859-1 Dear list, Lets say i have a file, like this: id \t at,tlng <-- structure 1\t40.123,-50.432 2\t41.431,-43.32 ... ... lets call it: 'points.txt' I'm trying to build a map-reduce job that runs over this BIG points file and it should output a file, that will look like: id[lat,lng] \t [list of points in JSON standard] <--- structure 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]] 2[41.431,-43.32]\t[[40.123,-50.432],...[,]] ... Basically it should run on ITSELF, and grab for each point the N (it will be an argument for the job) CLOSEST points (the mappers should calculate the distance).. Distributed cache is not an option, what else? not sure if to classify it as a map-join , reduce-join or both? Would you do this in HIVE some how? Is it feasible in a single job? Would LOVE to hear your suggestions, code (if you think its not that hard) or what not. BTW using CDH3 - rev 3 (20.23) Thanks! --e89a8f234567df70ab04c844fcd6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

Dear list,

Lets say i have a file, like= this:

id \t at,tlng <-- structure

1\= t40.123,-50.432

2\t41.431,-43.32

...

lets call it: 'points.txt'

I'm trying to build a= map-reduce job that runs over this BIG points file and it should output

a file, that will look like:

id[lat,lng] \t [list of poin= ts in JSON standard] <--- structure

1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]= ]

2[41.431,-43.32]\t[[40.123,-50.432],...[,]]

...

Basically it should run on ITSELF, and grab for each po= int the N (it will be an argument for the job) CLOSEST points (the mappers = should calculate the distance)..

Distributed cache is not an option, what else? =A0not s= ure if to classify it as a map-join , reduce-join or both?

Would = you do this in HIVE some how?=A0

Is it feasible in a single job?<= /div>

Would LOVE to hear your suggestions, code (if you think= its not that hard) or what not.

BTW using CDH3 - rev 3 (20.23)

Thanks!

--e89a8f234567df70ab04c844fcd6--