hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zheyi rong <zheyi.r...@gmail.com>
Subject Re: Cartesian product in hadoop
Date Fri, 19 Apr 2013 11:02:52 GMT
Hi Ajay Srivastava,

thank you for your explanation.



Regards,
Zheyi Rong


On Thu, Apr 18, 2013 at 5:18 PM, Ajay Srivastava <Ajay.Srivastava@guavus.com
> wrote:

>
>  The approach which I proposed will have m+n i/o for reading datasets not
> the (m + n + m*n) and but further i/o due to spills and reading mapper
> output by reducer will be more as number of tuples coming out of mapper are
> ( m + m * n).
>
>
>  Regards,
> Ajay Srivastava
>
>
>  On 18-Apr-2013, at 5:40 PM, zheyi rong wrote:
>
>  Thank you, now I get your point.
>
>  But I wonder that this approach would be slower than
> implementing a custom InputFormat which, each time, provides a pair of
> lines to mappers; then doing the product in mappers?  (in
>
>  Since your approach would need (m + n + m*n) I/O in mapper side, and
> (2*m*n) IO in reducer side;
> while with implementing a custom InputFormat, the I/O is (m*n).
>
>  I am asking this because I have implemented the custom InputFormat, but
> the running time is still intolerable in our small cluster.
>
>
>  Regards,
> Zheyi Rong
>
>
> On Thu, Apr 18, 2013 at 1:45 PM, Ajay Srivastava <
> Ajay.Srivastava@guavus.com> wrote:
>
>>  Yes, that's a crucial part.
>>
>>  Write a class which extends WritableComparator and override compare
>> method.
>> You need to set this class in job client as -
>> job.setGroupingComparatorClass (Grouping comparator class).
>>
>>  This will make sure that records having same Ki will be grouped
>> together and will go to same iteration of reduce.
>> I forgot to mention in my previous post to write a partitioner too which
>> partitions data based on first part of key.
>>
>>
>>  Regards,
>> Ajay Srivastava
>>
>>
>>  On 18-Apr-2013, at 4:42 PM, zheyi rong wrote:
>>
>>  Hi Ajay Srivastava,
>>
>>  Thank your for your reply.
>>
>>  Could you please explain a little bit more on "Write a grouping
>> comparator which group records on first part of key i.e. Ki."  ?
>> I guess it is a crucial part, which could filter some pairs before
>> passing them to the reducer.
>>
>>
>>  Regards,
>> Zheyi Rong
>>
>>
>> On Thu, Apr 18, 2013 at 12:50 PM, Ajay Srivastava <
>> Ajay.Srivastava@guavus.com> wrote:
>>
>>>  Hi Rong,
>>> You can use following simple method.
>>>
>>>  Lets say dataset1 has m records and when you emit these records from
>>> mapper, keys are K1,K2 ….., Km for each respective record. Also add an
>>> identifier to identify dataset from where records is being emitted.
>>> So if R1 is a record in dataset1, the mapper will emit key as (K1,
>>> DATASET1) and value as R1.
>>>
>>>  For dataset2 having n records, emit m records for each record with
>>> keys K1, K2, …., Km and identifier as DATASET2.
>>> So if R1' is a record from dataset2, emit m records with key as  (Ki,
>>> DATASET2) and value R1' where i is from 1 to m.
>>>
>>>
>>>  Write a grouping comparator which group records on first part of key
>>> i.e. Ki.
>>>
>>>  In reducer, for each iteration of reduce there will be one record from
>>> dataset1 and n records from dataset2. Get the cartesian product, apply
>>> filter and then output.
>>>
>>>
>>>  Note -- You may not know keys (K1, K2, … , Km) before hand. If yes,
>>> then you need one more pass of dataset1 to identify the keys and store it
>>> to use for dataset2.
>>>
>>>
>>>  Regards,
>>> Ajay Srivastava
>>>
>>>
>>>  On 18-Apr-2013, at 3:51 PM, Azuryy Yu wrote:
>>>
>>>  This is not suitable for his large dataset.
>>>
>>> --Send from my Sony mobile.
>>> On Apr 18, 2013 5:58 PM, "Jagat Singh" <jagatsingh@gmail.com> wrote:
>>>
>>>>  Hi,
>>>>
>>>> Can you have a look at
>>>>
>>>> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>>>>
>>>>  Thanks
>>>>
>>>>
>>>> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong <zheyi.rong@gmail.com>wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>>  I am writing to kindly ask for ideas of doing cartesian product in
>>>>> hadoop.
>>>>> Specifically, now I have two datasets, each of which contains
>>>>> 20million lines.
>>>>> I want to do cartesian product on these two datasets, comparing lines
>>>>> pairwisely.
>>>>>
>>>>>  The output of each comparison can be mostly filtered by a function (
>>>>> we do not output the
>>>>> whole result of this cartesian product, but only a small part).
>>>>>
>>>>>  I guess one good way is to pass one block from dataset1 and another
>>>>> block from dataset2
>>>>> to a mapper, then let the mappers do the product in memory to avoid IO.
>>>>>
>>>>>  Any suggestions?
>>>>> Thank you very much.
>>>>>
>>>>>  Regards,
>>>>> Zheyi Rong
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Mime
View raw message