hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Reading from HDFS from inside the mapper
Date Mon, 17 Sep 2012 13:46:41 GMT
Sigurd,

The implementation of fs -ls in the LocalFileSystem relies on Java's
File#list http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
which states "There is no guarantee that the name strings in the
resulting array will appear in any specific order; they are not, in
particular, guaranteed to appear in alphabetical order.". That may
just be what is biting you, since standalone mode uses LFS.

On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
<sigurd.spieckermann@gmail.com> wrote:
> I've tracked down the problem to only occur in standalone mode. In
> pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu
> 12.04 64bit. When I access the directory in linux directly, everything looks
> normal. It's just when I access it through hadoop. Has anyone seen this
> problem before and knows a solution?
>
> Thanks,
> Sigurd
>
>
> 2012/9/17 Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
>>
>> I'm experiencing a strange problem right now. I'm writing part-files to
>> the HDFS providing initial data and (which should actually not make a
>> difference anyway) write them in ascending order, i.e. part-00000,
>> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
>> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
>> that possible? Why aren't they shown in natural order? Also the map-side
>> join package considers them in this order which causes problems.
>>
>>
>> 2012/9/10 Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
>>>
>>> OK, interesting. Just to confirm: is it okay to distribute quite large
>>> files through the DistributedCache? Dataset B could be on the order of
>>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
>>> the probability that every node will have to read (almost) every block of B
>>> is quite high so given DC is okay here in general, it would be more
>>> efficient to use DC over HDFS reading. How about the case though that I have
>>> m*n nodes, then every node would receive all of B while only needing a small
>>> fraction, right? Could you maybe elaborate on this in a few sentence just to
>>> be sure I understand Hadoop correctly?
>>>
>>> Thanks,
>>> Sigurd
>>>
>>> 2012/9/10 Harsh J <harsh@cloudera.com>
>>>>
>>>> Sigurd,
>>>>
>>>> Hemanth's recommendation of DistributedCache does fit your requirement
>>>> - it is a generic way of distributing files and archives to tasks of a
>>>> job. It is not something that pushes things automatically in memory,
>>>> but on the local disk of the TaskTracker your task runs on. You can
>>>> choose to then use a LocalFileSystem impl. to read it out from there,
>>>> which would end up being (slightly) faster than your same approach
>>>> applied to MapFiles on HDFS.
>>>>
>>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>>>
>>>> <sigurd.spieckermann@gmail.com> wrote:
>>>> > I checked DistributedCache, but in general I have to assume that none
>>>> > of the
>>>> > datasets fits in memory... That's why I was considering map-side join,
>>>> > but
>>>> > by default it doesn't fit to my problem. I could probably get it to
>>>> > work
>>>> > though, but I would have to enforce the requirements of the map-side
>>>> > join.
>>>> >
>>>> >
>>>> > 2012/9/10 Hemanth Yamijala <yhemanth@thoughtworks.com>
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> You could check DistributedCache
>>>> >>
>>>> >> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
>>>> >> It would allow you to distribute data to the nodes where your tasks
>>>> >> are run.
>>>> >>
>>>> >> Thanks
>>>> >> Hemanth
>>>> >>
>>>> >>
>>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>>>> >> <sigurd.spieckermann@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I would like to perform a map-side join of two large datasets
where
>>>> >>> dataset A consists of m*n elements and dataset B consists of
n
>>>> >>> elements. For
>>>> >>> the join, every element in dataset B needs to be accessed m
times.
>>>> >>> Each
>>>> >>> mapper would join one element from A with the corresponding
element
>>>> >>> from B.
>>>> >>> Elements here are actually data blocks. Is there a performance
>>>> >>> problem (and
>>>> >>> difference compared to a slightly modified map-side join using
the
>>>> >>> join-package) if I set dataset A as the map-reduce input and
load
>>>> >>> the
>>>> >>> relevant element from dataset B directly from the HDFS inside
the
>>>> >>> mapper? I
>>>> >>> could store the elements of B in a MapFile for faster random
access.
>>>> >>> In the
>>>> >>> second case without the join-package I would not have to partition
>>>> >>> the
>>>> >>> datasets manually which would allow a bit more flexibility,
but I'm
>>>> >>> wondering if HDFS access from inside a mapper is strictly bad.
Also,
>>>> >>> does
>>>> >>> Hadoop have a cache for such situations by any chance?
>>>> >>>
>>>> >>> I appreciate any comments!
>>>> >>>
>>>> >>> Sigurd
>>>> >>
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>
>>>
>>
>



-- 
Harsh J

Mime
View raw message