Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <4B28E21D.5060106@tid.es>
References: <4B28E21D.5060106@tid.es>
From: Todd Lipcon <todd@cloudera.com>
Date: Wed, 16 Dec 2009 08:51:29 -0800
Message-ID: <45f85f70912160851h247ab55fhf51273c01d786c0a@mail.gmail.com>
Subject: Re: map reduce to achieve cartessian product
To: common-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=000e0cd28d849b6fc4047adb526d

--000e0cd28d849b6fc4047adb526d
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Eguzki,

Is one of the tables vastly smaller than the other? If one is small enough
to fit in RAM, you can do this like so:

1. Add the small file to the DistributedCache
2. In the configure() method of the mapper, read the entire file into an
ArrayList or somesuch in RAM
3. Set the input path of the MR job to be the large file. Use no reduces
4. In the map function, simply iterate over the ArrayList and output each
pair.

If the small file doesn't fit in RAM, you could split it into chunks first,
and then run one MR job per chunk.
Assumedly, though, one of the two smiles is small - if they're both big
you're going to have a very very big output!

-Todd

On Wed, Dec 16, 2009 at 5:35 AM, Eguzki Astiz Lezaun <eguzki@tid.es> wrote:

> Hi,
>
> First, I would like to apologise if this question has been asked before (=
I
> am quite sure it has been) and I would appreciate very much if someone
> replies with a link to the answer.
>
> My question is quite simple.
>
> I have to files or datasets having a list of integers.
>
> example:
> dataset A: (a,b,c)
> dataset B: (d,e,f)
>
> I would like to design a map-reduce job to have at the ouput:
>
> (a,d)
> (a,e)
> (a,f)
> (b,d)
> (b,e)
> (b,f)
> (c,d)
> (c,e)
> (c,f)
>
> I guess this is a typical cartessian product of two datasets.
>
> I found ways to do joins using map-reduce, but a common key is required o=
n
> both dataset. This is not the case.
>
> Any clue how to do this?
>
> Thanks in advance.
> --
> Eguzki Astiz Lezaun
> Technology and Architecture Strategy
> C\ VIA AUGUSTA, 177     Tel: +34 93 36 53179
> 08021 BARCELONA         www.tid.es
>
> Telef=F3nica Investigaci=F3n y Desarrollo
> EKO     Do you need to print it? We protect the environment.
>
>

--000e0cd28d849b6fc4047adb526d--