hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: map reduce to achieve cartessian product
Date Wed, 16 Dec 2009 17:46:30 GMT
On Wed, Dec 16, 2009 at 12:29 PM, Todd Lipcon <todd@cloudera.com> wrote:
> Hi Eguzki,
>
> I wouldn't say the size of the list fitting into RAM would be the
> scalability bottleneck. If you're doing a full cartesian join of your users
> against a larger table, the fact that you're doing the full cartesian join
> is going to be the bottleneck first :)
>
> -Todd
>
> On Wed, Dec 16, 2009 at 9:24 AM, Eguzki Astiz Lezaun <eguzki@tid.es> wrote:
>
>> Thanks Todd,
>>
>> That was my plan-B or workaround. Anyway, I am happy to see there is no
>> straight way to do so I could miss.
>>
>> The "small" list is a list of userId (dim table), so I can assume it as
>> "small" but that can be a limitation in the scalability of our system. I
>> will test the upper limits.
>>
>> Thanks a lot.
>>
>> Eguzki
>>
>> Todd Lipcon escribió:
>>
>>  Hi Eguzki,
>>>
>>> Is one of the tables vastly smaller than the other? If one is small enough
>>> to fit in RAM, you can do this like so:
>>>
>>> 1. Add the small file to the DistributedCache
>>> 2. In the configure() method of the mapper, read the entire file into an
>>> ArrayList or somesuch in RAM
>>> 3. Set the input path of the MR job to be the large file. Use no reduces
>>> 4. In the map function, simply iterate over the ArrayList and output each
>>> pair.
>>>
>>> If the small file doesn't fit in RAM, you could split it into chunks
>>> first,
>>> and then run one MR job per chunk.
>>> Assumedly, though, one of the two smiles is small - if they're both big
>>> you're going to have a very very big output!
>>>
>>> -Todd
>>>
>>> On Wed, Dec 16, 2009 at 5:35 AM, Eguzki Astiz Lezaun <eguzki@tid.es>
>>> wrote:
>>>
>>>
>>>
>>>> Hi,
>>>>
>>>> First, I would like to apologise if this question has been asked before
>>>> (I
>>>> am quite sure it has been) and I would appreciate very much if someone
>>>> replies with a link to the answer.
>>>>
>>>> My question is quite simple.
>>>>
>>>> I have to files or datasets having a list of integers.
>>>>
>>>> example:
>>>> dataset A: (a,b,c)
>>>> dataset B: (d,e,f)
>>>>
>>>> I would like to design a map-reduce job to have at the ouput:
>>>>
>>>> (a,d)
>>>> (a,e)
>>>> (a,f)
>>>> (b,d)
>>>> (b,e)
>>>> (b,f)
>>>> (c,d)
>>>> (c,e)
>>>> (c,f)
>>>>
>>>> I guess this is a typical cartessian product of two datasets.
>>>>
>>>> I found ways to do joins using map-reduce, but a common key is required
>>>> on
>>>> both dataset. This is not the case.
>>>>
>>>> Any clue how to do this?
>>>>
>>>> Thanks in advance.
>>>> --
>>>> Eguzki Astiz Lezaun
>>>> Technology and Architecture Strategy
>>>> C\ VIA AUGUSTA, 177     Tel: +34 93 36 53179
>>>> 08021 BARCELONA         www.tid.es
>>>>
>>>> Telefónica Investigación y Desarrollo
>>>> EKO     Do you need to print it? We protect the environment.
>>>>
>>>>
>>>>
>>>>
>>> .
>>>
>>>
>>>
>>
>> --
>> Eguzki Astiz Lezaun
>> Technology and Architecture Strategy
>> C\ VIA AUGUSTA, 177     Tel: +34 93 36 53179
>> 08021 BARCELONA         www.tid.es
>>
>> Telefónica Investigación y Desarrollo
>> EKO     Do you need to print it? We protect the environment.
>>
>>
>

Cartesian product and normalized data do not normally go together. One
exception when you want something like this:

                     index.htm         links.htm             hadoop.htm
ed                      0                    1                            2
bob                    2                    0                            1
jon                     0                    0                            0

In this case you want a CP of user and pages, and count. Great report.
But does one want a report 1,000,000 rows wide or long. You have to
restrict on something. The CP also only helps you get 0 row counts
which you can logically assume something is 0 if not found in the
result of the inner join.

Though I have never tried, a map only job could easily generate in
parallel a data set of CP x={1,90000},y={1,3434343}, but no optimal
way to join and an already exists large data set. (Unless I am missing
something (which happens quite often))

Mime
View raw message