Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 90487 invoked from network); 16 Dec 2009 17:46:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Dec 2009 17:46:58 -0000 Received: (qmail 46253 invoked by uid 500); 16 Dec 2009 17:46:55 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 46168 invoked by uid 500); 16 Dec 2009 17:46:54 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 46158 invoked by uid 99); 16 Dec 2009 17:46:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Dec 2009 17:46:54 +0000 X-ASF-Spam-Status: No, hits=-2.7 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of edlinuxguru@gmail.com designates 209.85.220.215 as permitted sender) Received: from [209.85.220.215] (HELO mail-fx0-f215.google.com) (209.85.220.215) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Dec 2009 17:46:51 +0000 Received: by fxm7 with SMTP id 7so1186671fxm.29 for ; Wed, 16 Dec 2009 09:46:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=mVrJpRgpPukIOd1aVgRT5BJLd75NTYqupRfHsodKmLk=; b=EAMxY4EZUWlhIdYZp7p/rQL1GQ8Iu8AToYzbzPoUQcgy6d04Gn9cp+3tEZ9cjLgFiY K7+mOK//G3husKf8GupSDYxLzHCQtMjX4idnCXCIaFp+MnUr0+sPxSP9j7YZlokgXmvl YBuctHqkn7kW97giGjNB3ATc/cB2Fu6toeRs8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=PG2WBYjtkXIQu433rBFmeOAcWCpYBBPDVHv5ynCcz1zcQ5DBfPIRDAagwT5B2SXudV daAWRZGhj5P8wJKDqgYutVCPIr1hExgQ6wCBqLN7hzc98SoZw4WJRsERoITJP51QTvsC 3YMpBLdlBIvhFX9P3USucHczJBJP34EXCZrB0= MIME-Version: 1.0 Received: by 10.239.168.131 with SMTP id k3mr138112hbe.101.1260985590125; Wed, 16 Dec 2009 09:46:30 -0800 (PST) In-Reply-To: <45f85f70912160929s54211085ob212b6cd83daf8da@mail.gmail.com> References: <4B28E21D.5060106@tid.es> <45f85f70912160851h247ab55fhf51273c01d786c0a@mail.gmail.com> <4B2917C7.1040007@tid.es> <45f85f70912160929s54211085ob212b6cd83daf8da@mail.gmail.com> Date: Wed, 16 Dec 2009 12:46:30 -0500 Message-ID: Subject: Re: map reduce to achieve cartessian product From: Edward Capriolo To: common-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Wed, Dec 16, 2009 at 12:29 PM, Todd Lipcon wrote: > Hi Eguzki, > > I wouldn't say the size of the list fitting into RAM would be the > scalability bottleneck. If you're doing a full cartesian join of your use= rs > against a larger table, the fact that you're doing the full cartesian joi= n > is going to be the bottleneck first :) > > -Todd > > On Wed, Dec 16, 2009 at 9:24 AM, Eguzki Astiz Lezaun wrot= e: > >> Thanks Todd, >> >> That was my plan-B or workaround. Anyway, I am happy to see there is no >> straight way to do so I could miss. >> >> The "small" list is a list of userId (dim table), so I can assume it as >> "small" but that can be a limitation in the scalability of our system. I >> will test the upper limits. >> >> Thanks a lot. >> >> Eguzki >> >> Todd Lipcon escribi=F3: >> >> =A0Hi Eguzki, >>> >>> Is one of the tables vastly smaller than the other? If one is small eno= ugh >>> to fit in RAM, you can do this like so: >>> >>> 1. Add the small file to the DistributedCache >>> 2. In the configure() method of the mapper, read the entire file into a= n >>> ArrayList or somesuch in RAM >>> 3. Set the input path of the MR job to be the large file. Use no reduce= s >>> 4. In the map function, simply iterate over the ArrayList and output ea= ch >>> pair. >>> >>> If the small file doesn't fit in RAM, you could split it into chunks >>> first, >>> and then run one MR job per chunk. >>> Assumedly, though, one of the two smiles is small - if they're both big >>> you're going to have a very very big output! >>> >>> -Todd >>> >>> On Wed, Dec 16, 2009 at 5:35 AM, Eguzki Astiz Lezaun >>> wrote: >>> >>> >>> >>>> Hi, >>>> >>>> First, I would like to apologise if this question has been asked befor= e >>>> (I >>>> am quite sure it has been) and I would appreciate very much if someone >>>> replies with a link to the answer. >>>> >>>> My question is quite simple. >>>> >>>> I have to files or datasets having a list of integers. >>>> >>>> example: >>>> dataset A: (a,b,c) >>>> dataset B: (d,e,f) >>>> >>>> I would like to design a map-reduce job to have at the ouput: >>>> >>>> (a,d) >>>> (a,e) >>>> (a,f) >>>> (b,d) >>>> (b,e) >>>> (b,f) >>>> (c,d) >>>> (c,e) >>>> (c,f) >>>> >>>> I guess this is a typical cartessian product of two datasets. >>>> >>>> I found ways to do joins using map-reduce, but a common key is require= d >>>> on >>>> both dataset. This is not the case. >>>> >>>> Any clue how to do this? >>>> >>>> Thanks in advance. >>>> -- >>>> Eguzki Astiz Lezaun >>>> Technology and Architecture Strategy >>>> C\ VIA AUGUSTA, 177 =A0 =A0 Tel: +34 93 36 53179 >>>> 08021 BARCELONA =A0 =A0 =A0 =A0 www.tid.es >>>> >>>> Telef=F3nica Investigaci=F3n y Desarrollo >>>> EKO =A0 =A0 Do you need to print it? We protect the environment. >>>> >>>> >>>> >>>> >>> . >>> >>> >>> >> >> -- >> Eguzki Astiz Lezaun >> Technology and Architecture Strategy >> C\ VIA AUGUSTA, 177 =A0 =A0 Tel: +34 93 36 53179 >> 08021 BARCELONA =A0 =A0 =A0 =A0 www.tid.es >> >> Telef=F3nica Investigaci=F3n y Desarrollo >> EKO =A0 =A0 Do you need to print it? We protect the environment. >> >> > Cartesian product and normalized data do not normally go together. One exception when you want something like this: index.htm links.htm hadoop.htm ed 0 1 2 bob 2 0 1 jon 0 0 0 In this case you want a CP of user and pages, and count. Great report. But does one want a report 1,000,000 rows wide or long. You have to restrict on something. The CP also only helps you get 0 row counts which you can logically assume something is 0 if not found in the result of the inner join. Though I have never tried, a map only job could easily generate in parallel a data set of CP x=3D{1,90000},y=3D{1,3434343}, but no optimal way to join and an already exists large data set. (Unless I am missing something (which happens quite often))