hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasco Visser <vasco.vis...@gmail.com>
Subject Re: Pairwise Comparison of Large Datasets
Date Thu, 03 Jan 2013 00:47:38 GMT
Hi Rob,

Thanks for sharing. The approach you take is similar to how Pig
implements the cross product (see the cross section in:

What you'll probably find interesting is this article:
Processing Theta-Joins using MapReduce
Which features a similar grid like approach, but with some smart tricks.

Also you probably like Jimmy Lin's articles on pairwise similarity in
MR (http://www.umiacs.umd.edu/~jimmylin/publications/index.html).

best, Vasco

On Mon, Dec 31, 2012 at 7:42 PM, Rob Styles <rob@dynamicorange.com> wrote:
> Happy New Year :)
> Thought some of you might find this useful.
> We've developed an approach to doing pairwise comparisons on large datasets
> that doesn't require visibility of the whole dataset at any time. The
> approach brings together pairs for comparison using incrementing coordinates
> to target a value at a cell.
> http://dynamicorange.com/2012/12/31/pairwise-comparisons-of-large-datasets/
> There is still work to do on making the approach more efficient and trying
> to eliminate a pre-processing step. Help gratefully received.
> If there's a toolset already out there for doing this I'd be happy to hear
> about that too!
> thanks
> rob

View raw message