Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E6CE8F9C7 for ; Thu, 18 Apr 2013 11:46:35 +0000 (UTC) Received: (qmail 8546 invoked by uid 500); 18 Apr 2013 11:46:31 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 7862 invoked by uid 500); 18 Apr 2013 11:46:30 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 7850 invoked by uid 99); 18 Apr 2013 11:46:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Apr 2013 11:46:30 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Ajay.Srivastava@guavus.com designates 204.232.241.167 as permitted sender) Received: from [204.232.241.167] (HELO mx1.guavus.com) (204.232.241.167) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Apr 2013 11:46:23 +0000 Received: from MX3.guavus.com ([2002:cce8:f1a6::cce8:f1a6]) by mx1.guavus.com ([204.232.241.167]) with mapi id 14.01.0421.002; Thu, 18 Apr 2013 04:45:44 -0700 From: Ajay Srivastava To: "" Subject: Re: Cartesian product in hadoop Thread-Topic: Cartesian product in hadoop Thread-Index: AQHOPBnh5nFs77o0VkiBFgNMkEjE/5jcM5WAgAAGqgCAAAfugIAABi2AgAAJToA= Date: Thu, 18 Apr 2013 11:45:44 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [61.12.3.115] Content-Type: multipart/alternative; boundary="_000_A039D6B0E3BE49FCB53A9FF74F51500Bguavuscom_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_A039D6B0E3BE49FCB53A9FF74F51500Bguavuscom_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Yes, that's a crucial part. Write a class which extends WritableComparator and override compare method. You need to set this class in job client as - job.setGroupingComparatorClass (Grouping comparator class). This will make sure that records having same Ki will be grouped together an= d will go to same iteration of reduce. I forgot to mention in my previous post to write a partitioner too which pa= rtitions data based on first part of key. Regards, Ajay Srivastava On 18-Apr-2013, at 4:42 PM, zheyi rong wrote: Hi Ajay Srivastava, Thank your for your reply. Could you please explain a little bit more on "Write a grouping comparator = which group records on first part of key i.e. Ki." ? I guess it is a crucial part, which could filter some pairs before passing = them to the reducer. Regards, Zheyi Rong On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava > wrote: Hi Rong, You can use following simple method. Lets say dataset1 has m records and when you emit these records from mapper= , keys are K1,K2 =85.., Km for each respective record. Also add an identifi= er to identify dataset from where records is being emitted. So if R1 is a record in dataset1, the mapper will emit key as (K1, DATASET1= ) and value as R1. For dataset2 having n records, emit m records for each record with keys K1,= K2, =85., Km and identifier as DATASET2. So if R1' is a record from dataset2, emit m records with key as (Ki, DATAS= ET2) and value R1' where i is from 1 to m. Write a grouping comparator which group records on first part of key i.e. K= i. In reducer, for each iteration of reduce there will be one record from data= set1 and n records from dataset2. Get the cartesian product, apply filter a= nd then output. Note -- You may not know keys (K1, K2, =85 , Km) before hand. If yes, then = you need one more pass of dataset1 to identify the keys and store it to use= for dataset2. Regards, Ajay Srivastava On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote: This is not suitable for his large dataset. --Send from my Sony mobile. On Apr 18, 2013 5:58 PM, "Jagat Singh" > wrote: Hi, Can you have a look at http://pig.apache.org/docs/r0.11.1/basic.html#cross Thanks On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong > wrote: Dear all, I am writing to kindly ask for ideas of doing cartesian product in hadoop. Specifically, now I have two datasets, each of which contains 20million lin= es. I want to do cartesian product on these two datasets, comparing lines pairw= isely. The output of each comparison can be mostly filtered by a function ( we do = not output the whole result of this cartesian product, but only a small part). I guess one good way is to pass one block from dataset1 and another block f= rom dataset2 to a mapper, then let the mappers do the product in memory to avoid IO. Any suggestions? Thank you very much. Regards, Zheyi Rong --_000_A039D6B0E3BE49FCB53A9FF74F51500Bguavuscom_ Content-Type: text/html; charset="Windows-1252" Content-ID: <6D28E72B5D41BF4FB0F79C7DF2B989E7@guavus.com> Content-Transfer-Encoding: quoted-printable
Yes, that's a crucial part.

Write a class which extends WritableComparator and override compare me= thod.
You need to set this class in job client as -
job.setGroupingComparatorClass (Grouping comparator class).

This will make sure that records having same Ki will be grouped togeth= er and will go to same iteration of reduce.
I forgot to mention in my previous post to write a partitioner too whi= ch partitions data based on first part of key.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:

Hi Ajay Srivastava,

Thank your for your reply.

Could you please explain a= little bit more on "Write a grouping comparator which group records on first = part of key i.e. Ki."  ? 
I guess it is a crucial part, which could filter some pairs before passing= them to the reducer.


Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastav= a <Ajay.Sr= ivastava@guavus.com> wrote:
Hi Rong,
You can use following simple method.

Lets say dataset1 has m records and when you emit these records from m= apper, keys are K1,K2 =85.., Km for each respective record. Also add an ide= ntifier to identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapper will emit key as (K1, DAT= ASET1) and value as R1.

For dataset2 having n records, emit m records for each record with key= s K1, K2, =85., Km and identifier as DATASET2.
So if R1' is a record from dataset2, emit m records with key as  = (Ki, DATASET2) and value R1' where i is from 1 to m.


Write a grouping comparator which group records on first part of key i= .e. Ki.

In reducer, for each iteration of reduce there will be one record from= dataset1 and n records from dataset2. Get the cartesian product, apply fil= ter and then output.


Note -- You may not know keys (K1, K2, =85 , Km) before hand. If yes, = then you need one more pass of dataset1 to identify the keys and store it t= o use for dataset2.


Regards,
Ajay Srivastava


On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:

This is not suitable for his large dataset.

--Send from my Sony mobile.

On Apr 18, 2013 5:58 PM, "Jagat Singh"= <jagatsingh@g= mail.com> wrote:
Thanks


On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zheyi.rong@gm= ail.com> wrote:
Dear all, 

I am writing to kindly ask for ideas of doing cartesian product in had= oop.
Specifically, now I have two datasets, each of which contains 20millio= n lines.
I want to do cartesian product on these two datasets, comparing lines = pairwisely.

The output of each comparison can be mostly filtered by a function ( w= e do not output the 
whole result of this cartesian product, but only a small part).

I guess one good way is to pass one block from dataset1 and another bl= ock from dataset2
to a mapper, then let the mappers do the product in memory to avoid IO= .

Any suggestions? 
Thank you very much.

Regards,
Zheyi Rong




--_000_A039D6B0E3BE49FCB53A9FF74F51500Bguavuscom_--