hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
Subject Re: Reading from HDFS from inside the mapper
Date Mon, 17 Sep 2012 12:47:12 GMT
I'm experiencing a strange problem right now. I'm writing part-files to the
HDFS providing initial data and (which should actually not make a
difference anyway) write them in ascending order, i.e. part-00000,
part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
that possible? Why aren't they shown in natural order? Also the map-side
join package considers them in this order which causes problems.

2012/9/10 Sigurd Spieckermann <sigurd.spieckermann@gmail.com>

> OK, interesting. Just to confirm: is it okay to distribute quite large
> files through the DistributedCache? Dataset B could be on the order of
> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
> the probability that every node will have to read (almost) every block of B
> is quite high so given DC is okay here in general, it would be more
> efficient to use DC over HDFS reading. How about the case though that I
> have m*n nodes, then every node would receive all of B while only needing a
> small fraction, right? Could you maybe elaborate on this in a few sentence
> just to be sure I understand Hadoop correctly?
>
> Thanks,
> Sigurd
>
> 2012/9/10 Harsh J <harsh@cloudera.com>
>
>> Sigurd,
>>
>> Hemanth's recommendation of DistributedCache does fit your requirement
>> - it is a generic way of distributing files and archives to tasks of a
>> job. It is not something that pushes things automatically in memory,
>> but on the local disk of the TaskTracker your task runs on. You can
>> choose to then use a LocalFileSystem impl. to read it out from there,
>> which would end up being (slightly) faster than your same approach
>> applied to MapFiles on HDFS.
>>
>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>
>> <sigurd.spieckermann@gmail.com> wrote:
>> > I checked DistributedCache, but in general I have to assume that none
>> of the
>> > datasets fits in memory... That's why I was considering map-side join,
>> but
>> > by default it doesn't fit to my problem. I could probably get it to work
>> > though, but I would have to enforce the requirements of the map-side
>> join.
>> >
>> >
>> > 2012/9/10 Hemanth Yamijala <yhemanth@thoughtworks.com>
>> >>
>> >> Hi,
>> >>
>> >> You could check DistributedCache
>> >> (
>> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
>> ).
>> >> It would allow you to distribute data to the nodes where your tasks
>> are run.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >>
>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>> >> <sigurd.spieckermann@gmail.com> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I would like to perform a map-side join of two large datasets where
>> >>> dataset A consists of m*n elements and dataset B consists of n
>> elements. For
>> >>> the join, every element in dataset B needs to be accessed m times.
>> Each
>> >>> mapper would join one element from A with the corresponding element
>> from B.
>> >>> Elements here are actually data blocks. Is there a performance
>> problem (and
>> >>> difference compared to a slightly modified map-side join using the
>> >>> join-package) if I set dataset A as the map-reduce input and load the
>> >>> relevant element from dataset B directly from the HDFS inside the
>> mapper? I
>> >>> could store the elements of B in a MapFile for faster random access.
>> In the
>> >>> second case without the join-package I would not have to partition the
>> >>> datasets manually which would allow a bit more flexibility, but I'm
>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>> does
>> >>> Hadoop have a cache for such situations by any chance?
>> >>>
>> >>> I appreciate any comments!
>> >>>
>> >>> Sigurd
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Mime
View raw message