hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zheyi rong <zheyi.r...@gmail.com>
Subject Re: Cartesian product in hadoop
Date Fri, 19 Apr 2013 11:04:49 GMT
Hi Ted Dunning,

could you please tell me some keywords so that I can google it myself?

Zheyi Rong

On Thu, Apr 18, 2013 at 8:52 PM, Ted Dunning <tdunning@maprtech.com> wrote:

> It is rarely practical to do exhaustive comparisons on datasets of this
> size.
> The method used is to heuristically prune the cartesian product set and
> only examine pairs that have a high likelihood of being near.
> This can be done in many ways.  Your suggestion of doing a map-side join
> is a reasonable one, but it will be much slower than methods where you can
> prune the comparisons.
> On Thu, Apr 18, 2013 at 9:47 AM, zheyi rong <zheyi.rong@gmail.com> wrote:
>> Dear all,
>> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
>> Specifically, now I have two datasets, each of which contains 20million
>> lines.
>> I want to do cartesian product on these two datasets, comparing lines
>> pairwisely.
>> The output of each comparison can be mostly filtered by a function ( we
>> do not output the
>> whole result of this cartesian product, but only a small part).
>> I guess one good way is to pass one block from dataset1 and another block
>> from dataset2
>> to a mapper, then let the mappers do the product in memory to avoid IO.
>> Any suggestions?
>> Thank you very much.
>> Regards,
>> Zheyi Rong

View raw message