hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ajay Srivastava <Ajay.Srivast...@guavus.com>
Subject Re: Cartesian product in hadoop
Date Thu, 18 Apr 2013 10:50:21 GMT
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from mapper, keys are K1,K2
….., Km for each respective record. Also add an identifier to identify dataset from where
records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1) and value as
R1.

For dataset2 having n records, emit m records for each record with keys K1, K2, …., Km and
identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  (Ki, DATASET2) and value
R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i.e. Ki.

In reducer, for each iteration of reduce there will be one record from dataset1 and n records
from dataset2. Get the cartesian product, apply filter and then output.


Note -- You may not know keys (K1, K2, … , Km) before hand. If yes, then you need one more
pass of dataset1 to identify the keys and store it to use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:


This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh" <jagatsingh@gmail.com<mailto:jagatsingh@gmail.com>>
wrote:
Hi,

Can you have a look at

http://pig.apache.org/docs/r0.11.1/basic.html#cross

Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zheyi.rong@gmail.com<mailto:zheyi.rong@gmail.com>>
wrote:
Dear all,

I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million lines.
I want to do cartesian product on these two datasets, comparing lines pairwisely.

The output of each comparison can be mostly filtered by a function ( we do not output the
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another block from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO.

Any suggestions?
Thank you very much.

Regards,
Zheyi Rong



Mime
View raw message