hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemanth Yamijala <yhema...@thoughtworks.com>
Subject Re: Reading from HDFS from inside the mapper
Date Mon, 10 Sep 2012 10:06:43 GMT

You could check DistributedCache (
It would allow you to distribute data to the nodes where your tasks are run.


On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <
sigurd.spieckermann@gmail.com> wrote:

> Hi,
> I would like to perform a map-side join of two large datasets where
> dataset A consists of m*n elements and dataset B consists of n elements.
> For the join, every element in dataset B needs to be accessed m times. Each
> mapper would join one element from A with the corresponding element from B.
> Elements here are actually data blocks. Is there a performance problem (and
> difference compared to a slightly modified map-side join using the
> join-package) if I set dataset A as the map-reduce input and load the
> relevant element from dataset B directly from the HDFS inside the mapper? I
> could store the elements of B in a MapFile for faster random access. In the
> second case without the join-package I would not have to partition the
> datasets manually which would allow a bit more flexibility, but I'm
> wondering if HDFS access from inside a mapper is strictly bad. Also, does
> Hadoop have a cache for such situations by any chance?
> I appreciate any comments!
> Sigurd

View raw message