Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates
 209.85.214.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CABVuHiijWeX4k4T_4gp5x6pvbOA7HcKFu-4Vr2miGA-MRzjEag@mail.gmail.com>
References: 
 <CABVuHiiwMUUpgO2U+MfztG6YzRLC03Lmxitv+zbvVQoHbK+gGA@mail.gmail.com>
 <CAEAKFL-0i6YW6UEL0=_EJ5m-hhE9DJzsC-yjRtp_Nu=__=fX2Q@mail.gmail.com>
 <CABVuHij01pyBbuGLYEucJodzRaUFm26WB0jMsMiKdPZ5kLpjiA@mail.gmail.com>
 <CAOcnVr1hktqmufbG6UN_FybB7cmOn0PP14CwL9i7041tp-gdDg@mail.gmail.com>
 <CABVuHigeADqtYcZLjYpXBwO+LSGotT3o_YWm8xfKsj1qJeyUvg@mail.gmail.com>
 <CABVuHihyAzuF3DbLtJsv8LVJT-kJc+c3armqHirXGYyjYojqsg@mail.gmail.com>
 <CABVuHiijWeX4k4T_4gp5x6pvbOA7HcKFu-4Vr2miGA-MRzjEag@mail.gmail.com>
From: Harsh J <harsh@cloudera.com>
Date: Mon, 17 Sep 2012 19:16:41 +0530
Message-ID: 
 <CAOcnVr1hdTXtrNSYpFB+t+gfKOvpkxENFmCXiwBDKqGT_hMnqw@mail.gmail.com>
Subject: Re: Reading from HDFS from inside the mapper
To: user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Sigurd,

The implementation of fs -ls in the LocalFileSystem relies on Java's
File#list http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
which states "There is no guarantee that the name strings in the
resulting array will appear in any specific order; they are not, in
particular, guaranteed to appear in alphabetical order.". That may
just be what is biting you, since standalone mode uses LFS.

On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
<sigurd.spieckermann@gmail.com> wrote:
> I've tracked down the problem to only occur in standalone mode. In
> pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu
> 12.04 64bit. When I access the directory in linux directly, everything looks
> normal. It's just when I access it through hadoop. Has anyone seen this
> problem before and knows a solution?
>
> Thanks,
> Sigurd
>
>
> 2012/9/17 Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
>>
>> I'm experiencing a strange problem right now. I'm writing part-files to
>> the HDFS providing initial data and (which should actually not make a
>> difference anyway) write them in ascending order, i.e. part-00000,
>> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
>> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
>> that possible? Why aren't they shown in natural order? Also the map-side
>> join package considers them in this order which causes problems.
>>
>>
>> 2012/9/10 Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
>>>
>>> OK, interesting. Just to confirm: is it okay to distribute quite large
>>> files through the DistributedCache? Dataset B could be on the order of
>>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
>>> the probability that every node will have to read (almost) every block of B
>>> is quite high so given DC is okay here in general, it would be more
>>> efficient to use DC over HDFS reading. How about the case though that I have
>>> m*n nodes, then every node would receive all of B while only needing a small
>>> fraction, right? Could you maybe elaborate on this in a few sentence just to
>>> be sure I understand Hadoop correctly?
>>>
>>> Thanks,
>>> Sigurd
>>>
>>> 2012/9/10 Harsh J <harsh@cloudera.com>
>>>>
>>>> Sigurd,
>>>>
>>>> Hemanth's recommendation of DistributedCache does fit your requirement
>>>> - it is a generic way of distributing files and archives to tasks of a
>>>> job. It is not something that pushes things automatically in memory,
>>>> but on the local disk of the TaskTracker your task runs on. You can
>>>> choose to then use a LocalFileSystem impl. to read it out from there,
>>>> which would end up being (slightly) faster than your same approach
>>>> applied to MapFiles on HDFS.
>>>>
>>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>>>
>>>> <sigurd.spieckermann@gmail.com> wrote:
>>>> > I checked DistributedCache, but in general I have to assume that none
>>>> > of the
>>>> > datasets fits in memory... That's why I was considering map-side join,
>>>> > but
>>>> > by default it doesn't fit to my problem. I could probably get it to
>>>> > work
>>>> > though, but I would have to enforce the requirements of the map-side
>>>> > join.
>>>> >
>>>> >
>>>> > 2012/9/10 Hemanth Yamijala <yhemanth@thoughtworks.com>
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> You could check DistributedCache
>>>> >>
>>>> >> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
>>>> >> It would allow you to distribute data to the nodes where your tasks
>>>> >> are run.
>>>> >>
>>>> >> Thanks
>>>> >> Hemanth
>>>> >>
>>>> >>
>>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>>>> >> <sigurd.spieckermann@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I would like to perform a map-side join of two large datasets where
>>>> >>> dataset A consists of m*n elements and dataset B consists of n
>>>> >>> elements. For
>>>> >>> the join, every element in dataset B needs to be accessed m times.
>>>> >>> Each
>>>> >>> mapper would join one element from A with the corresponding element
>>>> >>> from B.
>>>> >>> Elements here are actually data blocks. Is there a performance
>>>> >>> problem (and
>>>> >>> difference compared to a slightly modified map-side join using the
>>>> >>> join-package) if I set dataset A as the map-reduce input and load
>>>> >>> the
>>>> >>> relevant element from dataset B directly from the HDFS inside the
>>>> >>> mapper? I
>>>> >>> could store the elements of B in a MapFile for faster random access.
>>>> >>> In the
>>>> >>> second case without the join-package I would not have to partition
>>>> >>> the
>>>> >>> datasets manually which would allow a bit more flexibility, but I'm
>>>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>>>> >>> does
>>>> >>> Hadoop have a cache for such situations by any chance?
>>>> >>>
>>>> >>> I appreciate any comments!
>>>> >>>
>>>> >>> Sigurd
>>>> >>
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>
>>>
>>
>


-- 
Harsh J