hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
Subject Re: Reading from HDFS from inside the mapper
Date Mon, 17 Sep 2012 13:50:32 GMT
OK, I see... Is there any way to change this? I need guaranteed order in
order for the map-side join to work correctly and I need the standalone
mode for debugging code that is executed on the mapper/reducer nodes.

2012/9/17 Harsh J <harsh@cloudera.com>

> Sigurd,
>
> The implementation of fs -ls in the LocalFileSystem relies on Java's
> File#list
> http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
> which states "There is no guarantee that the name strings in the
> resulting array will appear in any specific order; they are not, in
> particular, guaranteed to appear in alphabetical order.". That may
> just be what is biting you, since standalone mode uses LFS.
>
> On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
> <sigurd.spieckermann@gmail.com> wrote:
> > I've tracked down the problem to only occur in standalone mode. In
> > pseudo-distributed mode, everything works fine. My underlying OS is
> Ubuntu
> > 12.04 64bit. When I access the directory in linux directly, everything
> looks
> > normal. It's just when I access it through hadoop. Has anyone seen this
> > problem before and knows a solution?
> >
> > Thanks,
> > Sigurd
> >
> >
> > 2012/9/17 Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
> >>
> >> I'm experiencing a strange problem right now. I'm writing part-files to
> >> the HDFS providing initial data and (which should actually not make a
> >> difference anyway) write them in ascending order, i.e. part-00000,
> >> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz",
> they
> >> are in the order part-00001, part-00000, part-00002, part-00003 etc.
> How is
> >> that possible? Why aren't they shown in natural order? Also the map-side
> >> join package considers them in this order which causes problems.
> >>
> >>
> >> 2012/9/10 Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
> >>>
> >>> OK, interesting. Just to confirm: is it okay to distribute quite large
> >>> files through the DistributedCache? Dataset B could be on the order of
> >>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A,
> then
> >>> the probability that every node will have to read (almost) every block
> of B
> >>> is quite high so given DC is okay here in general, it would be more
> >>> efficient to use DC over HDFS reading. How about the case though that
> I have
> >>> m*n nodes, then every node would receive all of B while only needing a
> small
> >>> fraction, right? Could you maybe elaborate on this in a few sentence
> just to
> >>> be sure I understand Hadoop correctly?
> >>>
> >>> Thanks,
> >>> Sigurd
> >>>
> >>> 2012/9/10 Harsh J <harsh@cloudera.com>
> >>>>
> >>>> Sigurd,
> >>>>
> >>>> Hemanth's recommendation of DistributedCache does fit your requirement
> >>>> - it is a generic way of distributing files and archives to tasks of
a
> >>>> job. It is not something that pushes things automatically in memory,
> >>>> but on the local disk of the TaskTracker your task runs on. You can
> >>>> choose to then use a LocalFileSystem impl. to read it out from there,
> >>>> which would end up being (slightly) faster than your same approach
> >>>> applied to MapFiles on HDFS.
> >>>>
> >>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
> >>>>
> >>>> <sigurd.spieckermann@gmail.com> wrote:
> >>>> > I checked DistributedCache, but in general I have to assume that
> none
> >>>> > of the
> >>>> > datasets fits in memory... That's why I was considering map-side
> join,
> >>>> > but
> >>>> > by default it doesn't fit to my problem. I could probably get it
to
> >>>> > work
> >>>> > though, but I would have to enforce the requirements of the map-side
> >>>> > join.
> >>>> >
> >>>> >
> >>>> > 2012/9/10 Hemanth Yamijala <yhemanth@thoughtworks.com>
> >>>> >>
> >>>> >> Hi,
> >>>> >>
> >>>> >> You could check DistributedCache
> >>>> >>
> >>>> >> (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
> ).
> >>>> >> It would allow you to distribute data to the nodes where your
tasks
> >>>> >> are run.
> >>>> >>
> >>>> >> Thanks
> >>>> >> Hemanth
> >>>> >>
> >>>> >>
> >>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
> >>>> >> <sigurd.spieckermann@gmail.com> wrote:
> >>>> >>>
> >>>> >>> Hi,
> >>>> >>>
> >>>> >>> I would like to perform a map-side join of two large datasets
> where
> >>>> >>> dataset A consists of m*n elements and dataset B consists
of n
> >>>> >>> elements. For
> >>>> >>> the join, every element in dataset B needs to be accessed
m times.
> >>>> >>> Each
> >>>> >>> mapper would join one element from A with the corresponding
> element
> >>>> >>> from B.
> >>>> >>> Elements here are actually data blocks. Is there a performance
> >>>> >>> problem (and
> >>>> >>> difference compared to a slightly modified map-side join
using the
> >>>> >>> join-package) if I set dataset A as the map-reduce input
and load
> >>>> >>> the
> >>>> >>> relevant element from dataset B directly from the HDFS
inside the
> >>>> >>> mapper? I
> >>>> >>> could store the elements of B in a MapFile for faster random
> access.
> >>>> >>> In the
> >>>> >>> second case without the join-package I would not have to
partition
> >>>> >>> the
> >>>> >>> datasets manually which would allow a bit more flexibility,
but
> I'm
> >>>> >>> wondering if HDFS access from inside a mapper is strictly
bad.
> Also,
> >>>> >>> does
> >>>> >>> Hadoop have a cache for such situations by any chance?
> >>>> >>>
> >>>> >>> I appreciate any comments!
> >>>> >>>
> >>>> >>> Sigurd
> >>>> >>
> >>>> >>
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Harsh J
> >>>
> >>>
> >>
> >
>
>
>
> --
> Harsh J
>

Mime
View raw message