hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edmund Kohlwey <ekohl...@gmail.com>
Subject Cross Join
Date Wed, 04 Nov 2009 13:52:06 GMT
I'm looking for an efficient way to do a cross join. I've gone through a 
few implementations, and I wanted to seek some advice before attempting 
another. The join is a "large collection to large collection" - so 
there's no trick optimizations like downloading one side of the join on 
each node (ie. map side join). The output of the join will be sparse, 
(its basically matching a large collection of regexes to a large 
collection of strings), but because of the nature of the data there's 
not really any way to pre-process either side of the join.

1. Naive approach - on a single node, iterate over both collections, 
resulting in reading the "left" file 1 times and the right file n times 
- I know this is bad.
2. Indexed approach - index data item with a row/col - requires 
replicating, sorting, and shuffling all the records 2 times - also not 
good. This actually seemed to perform worse than 1, and resulted in 
running out of disk space on the mappers when output was spilled to disk.

I'm now considering what to try next. One idea is to improve on 1 by 
"blocking" the reads, so that the right side of the join is read b 
times, where b is the number of blocks the left side is split into.

The other (imho, best) idea is to write a "reduce-side" join, which 
would actually be fully parallelized, which basically relies on 
map/reduce to split the left side into blocks, and then allows each 
reducer to stream through the right side once. In this version, the 
right side is still downloaded b times, but the operation is done in 
parallel. The only issue with this is that I would need to iterate over 
the reduce iterators multiple times, which is something that M/R doesn't 
allow (I think). I know I could save the contents of the iterator 
locally, but this seems like a bad design choice too. Does anybody know 
if there's a smart way to iterate twice in a reducer?

There's probably some methods I haven't really thought of. Does anyone 
have any suggestions?

View raw message