Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 58691 invoked from network); 16 Dec 2009 16:52:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Dec 2009 16:52:15 -0000 Received: (qmail 31838 invoked by uid 500); 16 Dec 2009 16:52:13 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 31762 invoked by uid 500); 16 Dec 2009 16:52:12 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 31752 invoked by uid 99); 16 Dec 2009 16:52:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Dec 2009 16:52:12 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [209.85.216.204] (HELO mail-px0-f204.google.com) (209.85.216.204) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Dec 2009 16:52:09 +0000 Received: by pxi42 with SMTP id 42so834240pxi.5 for ; Wed, 16 Dec 2009 08:51:49 -0800 (PST) MIME-Version: 1.0 Received: by 10.142.195.3 with SMTP id s3mr829755wff.172.1260982309186; Wed, 16 Dec 2009 08:51:49 -0800 (PST) In-Reply-To: <4B28E21D.5060106@tid.es> References: <4B28E21D.5060106@tid.es> From: Todd Lipcon Date: Wed, 16 Dec 2009 08:51:29 -0800 Message-ID: <45f85f70912160851h247ab55fhf51273c01d786c0a@mail.gmail.com> Subject: Re: map reduce to achieve cartessian product To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=000e0cd28d849b6fc4047adb526d --000e0cd28d849b6fc4047adb526d Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Eguzki, Is one of the tables vastly smaller than the other? If one is small enough to fit in RAM, you can do this like so: 1. Add the small file to the DistributedCache 2. In the configure() method of the mapper, read the entire file into an ArrayList or somesuch in RAM 3. Set the input path of the MR job to be the large file. Use no reduces 4. In the map function, simply iterate over the ArrayList and output each pair. If the small file doesn't fit in RAM, you could split it into chunks first, and then run one MR job per chunk. Assumedly, though, one of the two smiles is small - if they're both big you're going to have a very very big output! -Todd On Wed, Dec 16, 2009 at 5:35 AM, Eguzki Astiz Lezaun wrote: > Hi, > > First, I would like to apologise if this question has been asked before (= I > am quite sure it has been) and I would appreciate very much if someone > replies with a link to the answer. > > My question is quite simple. > > I have to files or datasets having a list of integers. > > example: > dataset A: (a,b,c) > dataset B: (d,e,f) > > I would like to design a map-reduce job to have at the ouput: > > (a,d) > (a,e) > (a,f) > (b,d) > (b,e) > (b,f) > (c,d) > (c,e) > (c,f) > > I guess this is a typical cartessian product of two datasets. > > I found ways to do joins using map-reduce, but a common key is required o= n > both dataset. This is not the case. > > Any clue how to do this? > > Thanks in advance. > -- > Eguzki Astiz Lezaun > Technology and Architecture Strategy > C\ VIA AUGUSTA, 177 Tel: +34 93 36 53179 > 08021 BARCELONA www.tid.es > > Telef=F3nica Investigaci=F3n y Desarrollo > EKO Do you need to print it? We protect the environment. > > --000e0cd28d849b6fc4047adb526d--