hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
Subject Re: Reading from HDFS from inside the mapper
Date Mon, 10 Sep 2012 11:54:58 GMT
OK, interesting. Just to confirm: is it okay to distribute quite large
files through the DistributedCache? Dataset B could be on the order of
gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
the probability that every node will have to read (almost) every block of B
is quite high so given DC is okay here in general, it would be more
efficient to use DC over HDFS reading. How about the case though that I
have m*n nodes, then every node would receive all of B while only needing a
small fraction, right? Could you maybe elaborate on this in a few sentence
just to be sure I understand Hadoop correctly?

Thanks,
Sigurd

2012/9/10 Harsh J <harsh@cloudera.com>

> Sigurd,
>
> Hemanth's recommendation of DistributedCache does fit your requirement
> - it is a generic way of distributing files and archives to tasks of a
> job. It is not something that pushes things automatically in memory,
> but on the local disk of the TaskTracker your task runs on. You can
> choose to then use a LocalFileSystem impl. to read it out from there,
> which would end up being (slightly) faster than your same approach
> applied to MapFiles on HDFS.
>
> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
> <sigurd.spieckermann@gmail.com> wrote:
> > I checked DistributedCache, but in general I have to assume that none of
> the
> > datasets fits in memory... That's why I was considering map-side join,
> but
> > by default it doesn't fit to my problem. I could probably get it to work
> > though, but I would have to enforce the requirements of the map-side
> join.
> >
> >
> > 2012/9/10 Hemanth Yamijala <yhemanth@thoughtworks.com>
> >>
> >> Hi,
> >>
> >> You could check DistributedCache
> >> (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
> ).
> >> It would allow you to distribute data to the nodes where your tasks are
> run.
> >>
> >> Thanks
> >> Hemanth
> >>
> >>
> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
> >> <sigurd.spieckermann@gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I would like to perform a map-side join of two large datasets where
> >>> dataset A consists of m*n elements and dataset B consists of n
> elements. For
> >>> the join, every element in dataset B needs to be accessed m times. Each
> >>> mapper would join one element from A with the corresponding element
> from B.
> >>> Elements here are actually data blocks. Is there a performance problem
> (and
> >>> difference compared to a slightly modified map-side join using the
> >>> join-package) if I set dataset A as the map-reduce input and load the
> >>> relevant element from dataset B directly from the HDFS inside the
> mapper? I
> >>> could store the elements of B in a MapFile for faster random access.
> In the
> >>> second case without the join-package I would not have to partition the
> >>> datasets manually which would allow a bit more flexibility, but I'm
> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
> does
> >>> Hadoop have a cache for such situations by any chance?
> >>>
> >>> I appreciate any comments!
> >>>
> >>> Sigurd
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Mime
View raw message