hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
Subject Reading from HDFS from inside the mapper
Date Mon, 10 Sep 2012 09:57:30 GMT

I would like to perform a map-side join of two large datasets where dataset
A consists of m*n elements and dataset B consists of n elements. For the
join, every element in dataset B needs to be accessed m times. Each mapper
would join one element from A with the corresponding element from B.
Elements here are actually data blocks. Is there a performance problem (and
difference compared to a slightly modified map-side join using the
join-package) if I set dataset A as the map-reduce input and load the
relevant element from dataset B directly from the HDFS inside the mapper? I
could store the elements of B in a MapFile for faster random access. In the
second case without the join-package I would not have to partition the
datasets manually which would allow a bit more flexibility, but I'm
wondering if HDFS access from inside a mapper is strictly bad. Also, does
Hadoop have a cache for such situations by any chance?

I appreciate any comments!


View raw message