hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart White <stuart.whi...@gmail.com>
Subject Efficient algorithm for many-to-many reduce-side join?
Date Thu, 28 May 2009 12:02:59 GMT
I need to do a reduce-side join of two datasets.  It's a many-to-many
join; that is, each dataset can can multiple records with any given
key.

Every description of a reduce-side join I've seen involves
constructing your keys out of your mapper such that records from one
dataset will be presented to the reducers before records from the
second dataset.  I should "hold on" to the value from the one dataset
and remember it as I iterate across the values from the second
dataset.

This seems like it only works well for one-to-many joins (when one of
your datasets will only have a single record with any given key).
This scales well because you're only remembering one value.

In a many-to-many join, if you apply this same algorithm, you'll need
to remember all values from one dataset, which of course will be
problematic (and won't scale) when dealing with large datasets with
large numbers of records with the same keys.

Does an efficient algorithm exist for a many-to-many reduce-side join?

Mime
View raw message