hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jay vyas <jayunit100.apa...@gmail.com>
Subject Re: hadoop/yarn and task parallelization on non-hdfs filesystems
Date Fri, 15 Aug 2014 15:38:05 GMT
Your FileSystem implementation should provide specific tuning parameters
for IO.

For example, in the GlusterFileSystem, we have a buffer parameter that is
typically
embedded into the core-site.xml.

https://github.com/gluster/glusterfs-hadoop/blob/master/src/main/java/org/apache/hadoop/fs/glusterfs/GlusterVolume.java


Similarly, in HDFS, there are tuning parameters that would go in
hdfs-site.xml

IIRC from your stackoverflow question, the Hadoop Compatible FileSystem you
are using is backed by a company of some sort, so
you should contact the engineers working on the implementation about how to
tune the underlying FS.

Regarding mapreduce and yarn - task optimization at that level is
independent of the underlying file system.  There are some parameters that
you can specify with your job, like setting the min number of tasks, which
can increase/decrease the number of total tasks.

>From some experience tuning web crawlers with this stuff, I can say that  a
high number will increase parallelism but might decrease availability of
your cluster (and locality of individual jobs).
A high # of tasks generally works good when doing something CPU or network
intensive.


On Fri, Aug 15, 2014 at 11:22 AM, java8964 <java8964@hotmail.com> wrote:

> I believe that Calvin mentioned before that this parallel file system
> mounted into local file system.
>
> In this case, will Hadoop just use java.io.File as local File system to
> treat them as local file and not split the file?
>
> Just want to know the logic in hadoop handling the local file.
>
> One suggestion I can think is to split the files manually outside of
> hadoop. For example, generate lots of small files as 128M or 256M size.
>
> In this case, each mapper will process one small file, so you can get good
> utilization of your cluster, assume you have a lot of small files.
>
> Yong
>
> > From: harsh@cloudera.com
> > Date: Fri, 15 Aug 2014 16:45:02 +0530
> > Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems
> > To: user@hadoop.apache.org
>
> >
> > Does your non-HDFS filesystem implement a getBlockLocations API, that
> > MR relies on to know how to split files?
> >
> > The API is at
> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus
> ,
> > long, long), and MR calls it at
> >
> https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L392
> >
> > If not, perhaps you can enforce a manual chunking by asking MR to use
> > custom min/max split sizes values via config properties:
> >
> https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L66
> >
> > On Fri, Aug 15, 2014 at 10:16 AM, Calvin <iphcalvin@gmail.com> wrote:
> > > I've looked a bit into this problem some more, and from what another
> > > person has written, HDFS is tuned to scale appropriately [1] given the
> > > number of input splits, etc.
> > >
> > > In the case of utilizing the local filesystem (which is really a
> > > network share on a parallel filesystem), the settings might be set
> > > conservatively in order not to thrash the local disks or present a
> > > bottleneck in processing.
> > >
> > > Since this isn't a big concern, I'd rather tune the settings to
> > > efficiently utilize the local filesystem.
> > >
> > > Are there any pointers to where in the source code I could look in
> > > order to tweak such parameters?
> > >
> > > Thanks,
> > > Calvin
> > >
> > > [1]
> https://stackoverflow.com/questions/25269964/hadoop-yarn-and-task-parallelization-on-non-hdfs-filesystems
> > >
> > > On Tue, Aug 12, 2014 at 12:29 PM, Calvin <iphcalvin@gmail.com> wrote:
> > >> Hi all,
> > >>
> > >> I've instantiated a Hadoop 2.4.1 cluster and I've found that running
> > >> MapReduce applications will parallelize differently depending on what
> > >> kind of filesystem the input data is on.
> > >>
> > >> Using HDFS, a MapReduce job will spawn enough containers to maximize
> > >> use of all available memory. For example, a 3-node cluster with 172GB
> > >> of memory with each map task allocating 2GB, about 86 application
> > >> containers will be created.
> > >>
> > >> On a filesystem that isn't HDFS (like NFS or in my use case, a
> > >> parallel filesystem), a MapReduce job will only allocate a subset of
> > >> available tasks (e.g., with the same 3-node cluster, about 25-40
> > >> containers are created). Since I'm using a parallel filesystem, I'm
> > >> not as concerned with the bottlenecks one would find if one were to
> > >> use NFS.
> > >>
> > >> Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml)
> > >> configuration that will allow me to effectively maximize resource
> > >> utilization?
> > >>
> > >> Thanks,
> > >> Calvin
> >
> >
> >
> > --
> > Harsh J
>



-- 
jay vyas

Mime
View raw message