Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of Ajay.Srivastava@guavus.com
 designates 204.232.241.167 as permitted sender)
From: Ajay Srivastava <Ajay.Srivastava@guavus.com>
To: "<user@hadoop.apache.org>" <user@hadoop.apache.org>
Subject: Re: Cartesian product in hadoop
Thread-Topic: Cartesian product in hadoop
Thread-Index: AQHOPBnh5nFs77o0VkiBFgNMkEjE/5jcM5WAgAAGqgCAAAfugIAABi2AgAAJToA=
Date: Thu, 18 Apr 2013 11:45:44 +0000
Message-ID: <A039D6B0-E3BE-49FC-B53A-9FF74F51500B@guavus.com>
References: 
 <CADww33McsGe-D88ngpb5EcaacP-sN=1ypZWqYq=p1ihwhtH_cQ@mail.gmail.com>
 <CAJ-d8Xc6_Xgs045dYHyG_fBYGbrnTC3Cf=OP6Jw9=6XeGiZH4g@mail.gmail.com>
 <CALr1C9rXWrA87xK126ghLPFWD0tYNbnZbC7ZN_aUrq=_=MSkEQ@mail.gmail.com>
 <BB6F3346-745F-4BF7-9F8C-CA4C8BF70DCD@guavus.com>
 <CADww33M37_vXmkTcs3Kpt1RG4g6_j80AQABWQaicgy4Z_ob3tw@mail.gmail.com>
In-Reply-To: 
 <CADww33M37_vXmkTcs3Kpt1RG4g6_j80AQABWQaicgy4Z_ob3tw@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_A039D6B0E3BE49FCB53A9FF74F51500Bguavuscom_"
MIME-Version: 1.0

--_000_A039D6B0E3BE49FCB53A9FF74F51500Bguavuscom_
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable

Yes, that's a crucial part.

Write a class which extends WritableComparator and override compare method.
You need to set this class in job client as -
job.setGroupingComparatorClass (Grouping comparator class).

This will make sure that records having same Ki will be grouped together an=
d will go to same iteration of reduce.
I forgot to mention in my previous post to write a partitioner too which pa=
rtitions data based on first part of key.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:

Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a little bit more on "Write a grouping comparator =
which group records on first part of key i.e. Ki."  ?
I guess it is a crucial part, which could filter some pairs before passing =
them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <Ajay.Srivastava@guavus.c=
om<mailto:Ajay.Srivastava@guavus.com>> wrote:
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper=
, keys are K1,K2 =85.., Km for each respective record. Also add an identifi=
er to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1=
) and value as R1.

For dataset2 having n records, emit m records for each record with keys K1,=
 K2, =85., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATAS=
ET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. K=
i.

In reducer, for each iteration of reduce there will be one record from data=
set1 and n records from dataset2. Get the cartesian product, apply filter a=
nd then output.


Note -- You may not know keys (K1, K2, =85 , Km) before hand. If yes, then =
you need one more pass of dataset1 to identify the keys and store it to use=
 for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <jagatsingh@gmail.com<mailto:jagatsi=
ngh@gmail.com>> wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zheyi.rong@gmail.com<mailto:zh=
eyi.rong@gmail.com>> wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lin=
es.
I want to do cartesian product on these two datasets, comparing lines pairw=
isely.

The output of each comparison can be mostly filtered by a function ( we do =
not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block f=
rom dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong


--_000_A039D6B0E3BE49FCB53A9FF74F51500Bguavuscom_
Content-Type: text/html; charset="Windows-1252"
Content-ID: <6D28E72B5D41BF4FB0F79C7DF2B989E7@guavus.com>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1=
252">
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space; ">
<div>Yes, that's a crucial part.</div>
<div><br>
</div>
<div>Write a class which extends WritableComparator and override compare me=
thod.</div>
<div>You need to set this class in job client as -</div>
<div>job.setGroupingComparatorClass (Grouping comparator class).</div>
<div><br>
</div>
<div>This will make sure that records having same Ki will be grouped togeth=
er and will go to same iteration of reduce.</div>
<div>I forgot to mention in my previous post to write a partitioner too whi=
ch partitions data based on first part of key.</div>
<div><br>
</div>
<div><br>
</div>
<div>Regards,</div>
<div>Ajay Srivastava</div>
<div><br>
</div>
<br>
<div>
<div>On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:</div>
<br class=3D"Apple-interchange-newline">
<blockquote type=3D"cite">
<div dir=3D"ltr"><span style=3D"font-family:arial,sans-serif;font-size:13px=
">Hi Ajay Srivastava,</span><br>
<div><span style=3D"font-family:arial,sans-serif;font-size:13px"><br>
</span></div>
<div style=3D""><span style=3D"font-family:arial,sans-serif;font-size:13px"=
>Thank your for your reply.</span></div>
<div style=3D""><br>
</div>
<div style=3D""><font face=3D"arial, sans-serif">Could you please explain a=
 little bit more on &quot;</font><span style=3D"font-family:arial,sans-seri=
f;font-size:13px">Write a grouping comparator which group records on first =
part of key i.e. Ki.&quot; &nbsp;?&nbsp;</span></div>
<div style=3D""><span style=3D"font-family:arial,sans-serif;font-size:13px"=
>I guess it is a crucial part, which could filter some pairs before passing=
 them to the reducer.</span></div>
<div style=3D""><br>
</div>
</div>
<div class=3D"gmail_extra"><br clear=3D"all">
<div>
<div>Regards,</div>
Zheyi Rong</div>
<br>
<br>
<div class=3D"gmail_quote">On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastav=
a <span dir=3D"ltr">
&lt;<a href=3D"mailto:Ajay.Srivastava@guavus.com" target=3D"_blank">Ajay.Sr=
ivastava@guavus.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"word-wrap:break-word">
<div>Hi Rong,</div>
<div>You can use following simple method.</div>
<div><br>
</div>
<div>Lets say dataset1 has m records and when you emit these records from m=
apper, keys are K1,K2 =85.., Km for each respective record. Also add an ide=
ntifier to identify dataset from where records is being emitted.</div>
<div>So if R1 is a record in dataset1, the mapper will emit key as (K1, DAT=
ASET1) and value as R1.</div>
<div><br>
</div>
<div>For dataset2 having n records, emit m records for each record with key=
s K1, K2, =85., Km and identifier as DATASET2.</div>
<div>So if R1' is a record from dataset2, emit m records with key as &nbsp;=
(Ki, DATASET2) and value R1' where i is from 1 to m.</div>
<div><br>
</div>
<div><br>
</div>
<div>Write a grouping comparator which group records on first part of key i=
.e. Ki.</div>
<div><br>
</div>
<div>In reducer, for each iteration of reduce there will be one record from=
 dataset1 and n records from dataset2. Get the cartesian product, apply fil=
ter and then output.</div>
<div><br>
</div>
<div><br>
</div>
<div>Note -- You may not know keys (K1, K2, =85 , Km) before hand. If yes, =
then you need one more pass of dataset1 to identify the keys and store it t=
o use for dataset2.</div>
<div><br>
</div>
<div><br>
</div>
<div>Regards,</div>
<div>Ajay Srivastava</div>
<div>
<div class=3D"h5">
<div><br>
</div>
<br>
<div>
<div>On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:</div>
<br>
<blockquote type=3D"cite">
<p dir=3D"ltr">This is not suitable for his large dataset.</p>
<p dir=3D"ltr">--Send from my Sony mobile.</p>
<div class=3D"gmail_quote">On Apr 18, 2013 5:58 PM, &quot;Jagat Singh&quot;=
 &lt;<a href=3D"mailto:jagatsingh@gmail.com" target=3D"_blank">jagatsingh@g=
mail.com</a>&gt; wrote:<br type=3D"attribution">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div dir=3D"ltr">
<div>Hi,<br>
<br>
Can you have a look at<br>
<br>
<a href=3D"http://pig.apache.org/docs/r0.11.1/basic.html#cross" target=3D"_=
blank">http://pig.apache.org/docs/r0.11.1/basic.html#cross</a><br>
<br>
</div>
Thanks<br>
</div>
<div class=3D"gmail_extra"><br>
<br>
<div class=3D"gmail_quote">On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <spa=
n dir=3D"ltr">
&lt;<a href=3D"mailto:zheyi.rong@gmail.com" target=3D"_blank">zheyi.rong@gm=
ail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div dir=3D"ltr">Dear all,&nbsp;
<div><br>
</div>
<div>I am writing to kindly ask for ideas of doing cartesian product in had=
oop.</div>
<div>Specifically, now I have two datasets, each of which contains 20millio=
n lines.</div>
<div>I want to do cartesian product on these two datasets, comparing lines =
pairwisely.</div>
<div><br>
</div>
<div>The output of each comparison can be mostly filtered by a function ( w=
e do not output the&nbsp;</div>
<div>whole result of this cartesian product, but only a small part).</div>
<div><br>
</div>
<div>I guess one good way is to pass one block from dataset1 and another bl=
ock from dataset2</div>
<div>to a mapper, then let the mappers do the product in memory to avoid IO=
.</div>
<div><br>
</div>
<div>Any suggestions?&nbsp;</div>
<div>Thank you very much.</div>
<div><br>
<div>
<div>
<div>Regards,</div>
Zheyi Rong</div>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
</div>
<br>
</body>
</html>

--_000_A039D6B0E3BE49FCB53A9FF74F51500Bguavuscom_--