Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of iphcalvin@gmail.com
 designates 209.85.213.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CADSD3vhDqkfLNV9VYg8R8KBDjJ4=Dwv2O8psHr2uB7zZgfP6=A@mail.gmail.com>
References: 
 <CADSD3vhDqkfLNV9VYg8R8KBDjJ4=Dwv2O8psHr2uB7zZgfP6=A@mail.gmail.com>
Date: Thu, 14 Aug 2014 22:46:07 -0600
Message-ID: 
 <CADSD3vh5B2Fp5bTFMatcd0p_dUDdHkE_V=4PsoKMfzf1JemPbQ@mail.gmail.com>
Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems
From: Calvin <iphcalvin@gmail.com>
To: user@hadoop.apache.org
Content-Type: text/plain; charset=UTF-8

I've looked a bit into this problem some more, and from what another
person has written, HDFS is tuned to scale appropriately [1] given the
number of input splits, etc.

In the case of utilizing the local filesystem (which is really a
network share on a parallel filesystem), the settings might be set
conservatively in order not to thrash the local disks or present a
bottleneck in processing.

Since this isn't a big concern, I'd rather tune the settings to
efficiently utilize the local filesystem.

Are there any pointers to where in the source code I could look in
order to tweak such parameters?

Thanks,
Calvin

[1] https://stackoverflow.com/questions/25269964/hadoop-yarn-and-task-parallelization-on-non-hdfs-filesystems

On Tue, Aug 12, 2014 at 12:29 PM, Calvin <iphcalvin@gmail.com> wrote:
> Hi all,
>
> I've instantiated a Hadoop 2.4.1 cluster and I've found that running
> MapReduce applications will parallelize differently depending on what
> kind of filesystem the input data is on.
>
> Using HDFS, a MapReduce job will spawn enough containers to maximize
> use of all available memory. For example, a 3-node cluster with 172GB
> of memory with each map task allocating 2GB, about 86 application
> containers will be created.
>
> On a filesystem that isn't HDFS (like NFS or in my use case, a
> parallel filesystem), a MapReduce job will only allocate a subset of
> available tasks (e.g., with the same 3-node cluster, about 25-40
> containers are created). Since I'm using a parallel filesystem, I'm
> not as concerned with the bottlenecks one would find if one were to
> use NFS.
>
> Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml)
> configuration that will allow me to effectively maximize resource
> utilization?
>
> Thanks,
> Calvin