hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: Cartesian product in hadoop
Date Thu, 18 Apr 2013 18:52:45 GMT
It is rarely practical to do exhaustive comparisons on datasets of this
size.

The method used is to heuristically prune the cartesian product set and
only examine pairs that have a high likelihood of being near.

This can be done in many ways.  Your suggestion of doing a map-side join is
a reasonable one, but it will be much slower than methods where you can
prune the comparisons.



On Thu, Apr 18, 2013 at 9:47 AM, zheyi rong <zheyi.rong@gmail.com> wrote:

> Dear all,
>
> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
> Specifically, now I have two datasets, each of which contains 20million
> lines.
> I want to do cartesian product on these two datasets, comparing lines
> pairwisely.
>
> The output of each comparison can be mostly filtered by a function ( we do
> not output the
> whole result of this cartesian product, but only a small part).
>
> I guess one good way is to pass one block from dataset1 and another block
> from dataset2
> to a mapper, then let the mappers do the product in memory to avoid IO.
>
> Any suggestions?
> Thank you very much.
>
> Regards,
> Zheyi Rong
>

Mime
View raw message