hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
Subject Re: Reading from HDFS from inside the mapper
Date Mon, 10 Sep 2012 10:45:38 GMT
I checked DistributedCache, but in general I have to assume that none of
the datasets fits in memory... That's why I was considering map-side join,
but by default it doesn't fit to my problem. I could probably get it to
work though, but I would have to enforce the requirements of the map-side

2012/9/10 Hemanth Yamijala <yhemanth@thoughtworks.com>

> Hi,
> You could check DistributedCache (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
> It would allow you to distribute data to the nodes where your tasks are run.
> Thanks
> Hemanth
> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <
> sigurd.spieckermann@gmail.com> wrote:
>> Hi,
>> I would like to perform a map-side join of two large datasets where
>> dataset A consists of m*n elements and dataset B consists of n elements.
>> For the join, every element in dataset B needs to be accessed m times. Each
>> mapper would join one element from A with the corresponding element from B.
>> Elements here are actually data blocks. Is there a performance problem (and
>> difference compared to a slightly modified map-side join using the
>> join-package) if I set dataset A as the map-reduce input and load the
>> relevant element from dataset B directly from the HDFS inside the mapper? I
>> could store the elements of B in a MapFile for faster random access. In the
>> second case without the join-package I would not have to partition the
>> datasets manually which would allow a bit more flexibility, but I'm
>> wondering if HDFS access from inside a mapper is strictly bad. Also, does
>> Hadoop have a cache for such situations by any chance?
>> I appreciate any comments!
>> Sigurd

View raw message