Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of edlinuxguru@gmail.com
 designates 209.85.220.215 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=PG2WBYjtkXIQu433rBFmeOAcWCpYBBPDVHv5ynCcz1zcQ5DBfPIRDAagwT5B2SXudV
         daAWRZGhj5P8wJKDqgYutVCPIr1hExgQ6wCBqLN7hzc98SoZw4WJRsERoITJP51QTvsC
         3YMpBLdlBIvhFX9P3USucHczJBJP34EXCZrB0=
MIME-Version: 1.0
In-Reply-To: <45f85f70912160929s54211085ob212b6cd83daf8da@mail.gmail.com>
References: <4B28E21D.5060106@tid.es>
	 <45f85f70912160851h247ab55fhf51273c01d786c0a@mail.gmail.com>
	 <4B2917C7.1040007@tid.es>
	 <45f85f70912160929s54211085ob212b6cd83daf8da@mail.gmail.com>
Date: Wed, 16 Dec 2009 12:46:30 -0500
Message-ID: <cbbf4b570912160946y49afe0batb73e0f0ed8944578@mail.gmail.com>
Subject: Re: map reduce to achieve cartessian product
From: Edward Capriolo <edlinuxguru@gmail.com>
To: common-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Wed, Dec 16, 2009 at 12:29 PM, Todd Lipcon <todd@cloudera.com> wrote:
> Hi Eguzki,
>
> I wouldn't say the size of the list fitting into RAM would be the
> scalability bottleneck. If you're doing a full cartesian join of your use=
rs
> against a larger table, the fact that you're doing the full cartesian joi=
n
> is going to be the bottleneck first :)
>
> -Todd
>
> On Wed, Dec 16, 2009 at 9:24 AM, Eguzki Astiz Lezaun <eguzki@tid.es> wrot=
e:
>
>> Thanks Todd,
>>
>> That was my plan-B or workaround. Anyway, I am happy to see there is no
>> straight way to do so I could miss.
>>
>> The "small" list is a list of userId (dim table), so I can assume it as
>> "small" but that can be a limitation in the scalability of our system. I
>> will test the upper limits.
>>
>> Thanks a lot.
>>
>> Eguzki
>>
>> Todd Lipcon escribi=F3:
>>
>> =A0Hi Eguzki,
>>>
>>> Is one of the tables vastly smaller than the other? If one is small eno=
ugh
>>> to fit in RAM, you can do this like so:
>>>
>>> 1. Add the small file to the DistributedCache
>>> 2. In the configure() method of the mapper, read the entire file into a=
n
>>> ArrayList or somesuch in RAM
>>> 3. Set the input path of the MR job to be the large file. Use no reduce=
s
>>> 4. In the map function, simply iterate over the ArrayList and output ea=
ch
>>> pair.
>>>
>>> If the small file doesn't fit in RAM, you could split it into chunks
>>> first,
>>> and then run one MR job per chunk.
>>> Assumedly, though, one of the two smiles is small - if they're both big
>>> you're going to have a very very big output!
>>>
>>> -Todd
>>>
>>> On Wed, Dec 16, 2009 at 5:35 AM, Eguzki Astiz Lezaun <eguzki@tid.es>
>>> wrote:
>>>
>>>
>>>
>>>> Hi,
>>>>
>>>> First, I would like to apologise if this question has been asked befor=
e
>>>> (I
>>>> am quite sure it has been) and I would appreciate very much if someone
>>>> replies with a link to the answer.
>>>>
>>>> My question is quite simple.
>>>>
>>>> I have to files or datasets having a list of integers.
>>>>
>>>> example:
>>>> dataset A: (a,b,c)
>>>> dataset B: (d,e,f)
>>>>
>>>> I would like to design a map-reduce job to have at the ouput:
>>>>
>>>> (a,d)
>>>> (a,e)
>>>> (a,f)
>>>> (b,d)
>>>> (b,e)
>>>> (b,f)
>>>> (c,d)
>>>> (c,e)
>>>> (c,f)
>>>>
>>>> I guess this is a typical cartessian product of two datasets.
>>>>
>>>> I found ways to do joins using map-reduce, but a common key is require=
d
>>>> on
>>>> both dataset. This is not the case.
>>>>
>>>> Any clue how to do this?
>>>>
>>>> Thanks in advance.
>>>> --
>>>> Eguzki Astiz Lezaun
>>>> Technology and Architecture Strategy
>>>> C\ VIA AUGUSTA, 177 =A0 =A0 Tel: +34 93 36 53179
>>>> 08021 BARCELONA =A0 =A0 =A0 =A0 www.tid.es
>>>>
>>>> Telef=F3nica Investigaci=F3n y Desarrollo
>>>> EKO =A0 =A0 Do you need to print it? We protect the environment.
>>>>
>>>>
>>>>
>>>>
>>> .
>>>
>>>
>>>
>>
>> --
>> Eguzki Astiz Lezaun
>> Technology and Architecture Strategy
>> C\ VIA AUGUSTA, 177 =A0 =A0 Tel: +34 93 36 53179
>> 08021 BARCELONA =A0 =A0 =A0 =A0 www.tid.es
>>
>> Telef=F3nica Investigaci=F3n y Desarrollo
>> EKO =A0 =A0 Do you need to print it? We protect the environment.
>>
>>
>

Cartesian product and normalized data do not normally go together. One
exception when you want something like this:

                     index.htm         links.htm             hadoop.htm
ed                      0                    1                            2
bob                    2                    0                            1
jon                     0                    0                            0

In this case you want a CP of user and pages, and count. Great report.
But does one want a report 1,000,000 rows wide or long. You have to
restrict on something. The CP also only helps you get 0 row counts
which you can logically assume something is 0 if not found in the
result of the inner join.

Though I have never tried, a map only job could easily generate in
parallel a data set of CP x=3D{1,90000},y=3D{1,3434343}, but no optimal
way to join and an already exists large data set. (Unless I am missing
something (which happens quite often))