hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Reading from HDFS from inside the mapper
Date Mon, 10 Sep 2012 11:41:42 GMT
Sigurd,

Hemanth's recommendation of DistributedCache does fit your requirement
- it is a generic way of distributing files and archives to tasks of a
job. It is not something that pushes things automatically in memory,
but on the local disk of the TaskTracker your task runs on. You can
choose to then use a LocalFileSystem impl. to read it out from there,
which would end up being (slightly) faster than your same approach
applied to MapFiles on HDFS.

On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
<sigurd.spieckermann@gmail.com> wrote:
> I checked DistributedCache, but in general I have to assume that none of the
> datasets fits in memory... That's why I was considering map-side join, but
> by default it doesn't fit to my problem. I could probably get it to work
> though, but I would have to enforce the requirements of the map-side join.
>
>
> 2012/9/10 Hemanth Yamijala <yhemanth@thoughtworks.com>
>>
>> Hi,
>>
>> You could check DistributedCache
>> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
>> It would allow you to distribute data to the nodes where your tasks are run.
>>
>> Thanks
>> Hemanth
>>
>>
>> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>> <sigurd.spieckermann@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I would like to perform a map-side join of two large datasets where
>>> dataset A consists of m*n elements and dataset B consists of n elements. For
>>> the join, every element in dataset B needs to be accessed m times. Each
>>> mapper would join one element from A with the corresponding element from B.
>>> Elements here are actually data blocks. Is there a performance problem (and
>>> difference compared to a slightly modified map-side join using the
>>> join-package) if I set dataset A as the map-reduce input and load the
>>> relevant element from dataset B directly from the HDFS inside the mapper? I
>>> could store the elements of B in a MapFile for faster random access. In the
>>> second case without the join-package I would not have to partition the
>>> datasets manually which would allow a bit more flexibility, but I'm
>>> wondering if HDFS access from inside a mapper is strictly bad. Also, does
>>> Hadoop have a cache for such situations by any chance?
>>>
>>> I appreciate any comments!
>>>
>>> Sigurd
>>
>>
>



-- 
Harsh J

Mime
View raw message