hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: map reduce to achieve cartessian product
Date Wed, 16 Dec 2009 16:51:29 GMT
Hi Eguzki,

Is one of the tables vastly smaller than the other? If one is small enough
to fit in RAM, you can do this like so:

1. Add the small file to the DistributedCache
2. In the configure() method of the mapper, read the entire file into an
ArrayList or somesuch in RAM
3. Set the input path of the MR job to be the large file. Use no reduces
4. In the map function, simply iterate over the ArrayList and output each

If the small file doesn't fit in RAM, you could split it into chunks first,
and then run one MR job per chunk.
Assumedly, though, one of the two smiles is small - if they're both big
you're going to have a very very big output!


On Wed, Dec 16, 2009 at 5:35 AM, Eguzki Astiz Lezaun <eguzki@tid.es> wrote:

> Hi,
> First, I would like to apologise if this question has been asked before (I
> am quite sure it has been) and I would appreciate very much if someone
> replies with a link to the answer.
> My question is quite simple.
> I have to files or datasets having a list of integers.
> example:
> dataset A: (a,b,c)
> dataset B: (d,e,f)
> I would like to design a map-reduce job to have at the ouput:
> (a,d)
> (a,e)
> (a,f)
> (b,d)
> (b,e)
> (b,f)
> (c,d)
> (c,e)
> (c,f)
> I guess this is a typical cartessian product of two datasets.
> I found ways to do joins using map-reduce, but a common key is required on
> both dataset. This is not the case.
> Any clue how to do this?
> Thanks in advance.
> --
> Eguzki Astiz Lezaun
> Technology and Architecture Strategy
> C\ VIA AUGUSTA, 177     Tel: +34 93 36 53179
> 08021 BARCELONA         www.tid.es
> Telef├│nica Investigaci├│n y Desarrollo
> EKO     Do you need to print it? We protect the environment.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message