Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 844A3DBCC for ; Mon, 27 Aug 2012 20:24:54 +0000 (UTC) Received: (qmail 98972 invoked by uid 500); 27 Aug 2012 20:24:50 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 98743 invoked by uid 500); 27 Aug 2012 20:24:49 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 98736 invoked by uid 99); 27 Aug 2012 20:24:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Aug 2012 20:24:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dextermorgan4u@gmail.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-lpp01m010-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Aug 2012 20:24:43 +0000 Received: by lagr15 with SMTP id r15so3139016lag.35 for ; Mon, 27 Aug 2012 13:24:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=WQNRwDfWg9TiRS0gMCuqfUMNNnos+li594RixjYi050=; b=kO2PlaARXuUhLOcRLpGskDEoQUeQLfxur97VZltARQ6z+VRr3kFejDkYbuJuRaJM1s R5CXiIzkau0mk3JT9SWhntxeIm3XB8Iv8o4DefBXkVnn2TtEeyyXoJXYTUs3dWDbWG4s BuvwSZQcnTygRuXaSKoOOy0T14ynAvNivr64/gosbv+Ihc3ByBq549fC7LOAbqe+1Arv 15xsmQQshw+QlldWaOngU9RbsEc8W7IaXSpEI94M6tbGyRkY9p225dia+0fG8HuwS3rB kUB6wZJV9ifbI2oVmg8MVUrsipat1WH1tQdmRhrNp7fpnZ5prxMGP9ItKMBjWav/SNp+ B3IQ== MIME-Version: 1.0 Received: by 10.152.104.44 with SMTP id gb12mr16069705lab.29.1346099062049; Mon, 27 Aug 2012 13:24:22 -0700 (PDT) Received: by 10.112.86.137 with HTTP; Mon, 27 Aug 2012 13:24:21 -0700 (PDT) In-Reply-To: References: Date: Mon, 27 Aug 2012 23:24:21 +0300 Message-ID: Subject: Re: best way to join? From: dexter morgan To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=f46d04088ef56d09fe04c8451c01 X-Virus-Checked: Checked by ClamAV on apache.org --f46d04088ef56d09fe04c8451c01 Content-Type: text/plain; charset=ISO-8859-1 Dear list, > > Lets say i have a file, like this: > id \t at,tlng <-- structure > > 1\t40.123,-50.432 > 2\t41.431,-43.32 > ... > ... > lets call it: 'points.txt' > I'm trying to build a map-reduce job that runs over this BIG points file > and it should output > a file, that will look like: > id[lat,lng] \t [list of points in JSON standard] <--- structure > > 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]] > 2[41.431,-43.32]\t[[40.123,-50.432],...[,]] > ... > > Basically it should run on ITSELF, and grab for each point the N (it will > be an argument for the job) CLOSEST points (the mappers should calculate > the distance).. > > Distributed cache is not an option, what else? not sure if to classify it > as a map-join , reduce-join or both? > Would you do this in HIVE some how? > Is it feasible in a single job? > > Would LOVE to hear your suggestions, code (if you think its not that hard) > or what not. > BTW using CDH3 - rev 3 (20.23) > > Thanks! > --f46d04088ef56d09fe04c8451c01 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable



Dear list,

Lets say i hav= e a file, like this:
id \t at,tlng <-- structure

1\t40.123,-50.= 432
2\t41.431,-43.32
...
...
lets call it: 'points.txt'
I'm trying to build a= map-reduce job that runs over this BIG points file and it should output
a file, that will look like:
id[lat,lng] \t [list of poin= ts in JSON standard] <--- structure

1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]= ]
2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
...
=

Basically it should run on ITSELF, and grab for each po= int the N (it will be an argument for the job) CLOSEST points (the mappers = should calculate the distance)..

Distributed cache is not an option, what else? =A0not s= ure if to classify it as a map-join , reduce-join or both?
Would = you do this in HIVE some how?=A0
Is it feasible in a single job?<= /div>

Would LOVE to hear your suggestions, code (if you think= its not that hard) or what not.
BTW using CDH3 - rev 3 (20.23)

Thanks!

--f46d04088ef56d09fe04c8451c01--