Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 734BA11F42 for ; Fri, 15 Aug 2014 04:46:41 +0000 (UTC) Received: (qmail 90637 invoked by uid 500); 15 Aug 2014 04:46:33 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 90528 invoked by uid 500); 15 Aug 2014 04:46:33 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 90515 invoked by uid 99); 15 Aug 2014 04:46:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Aug 2014 04:46:32 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of iphcalvin@gmail.com designates 209.85.213.169 as permitted sender) Received: from [209.85.213.169] (HELO mail-ig0-f169.google.com) (209.85.213.169) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Aug 2014 04:46:28 +0000 Received: by mail-ig0-f169.google.com with SMTP id r2so1418324igi.0 for ; Thu, 14 Aug 2014 21:46:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=dd5qny6zb9PBIkqft0lai7fnNyABs5a9JcS7Ho5ZxYE=; b=Z9tVpVflihrSsuou3aRwanByN/Yf7VHfQBzP2rrG+HGCufEdan5t1+wwhdxjHo1+8D 3XJ2gcwZ6GZngZHgjdEBpZ0szkBEtZZ7Ebl2PIu5cq292oYOKG8QGizEdth64Eg1Na08 oLA0h09yTANCyVE2wVLse3Jdxf1Js9Yamklnv/zdcfSzyZ5s8HjJR8vxZNv1UQyTWz4e AYAsi1wT3RSIZQLnkUDG4BJMs9RB1tSaFdXY2CJ08qAIIm3uaKYxcSAOFre3d/jpLTgi v0v+MKxRuSgc5hWdwCncZ4NR+E9grrXevuGhuVSJC6XBSVHnfXlUq9eM54x9xds87Fzu R9xw== MIME-Version: 1.0 X-Received: by 10.50.32.73 with SMTP id g9mr65175961igi.31.1408077967943; Thu, 14 Aug 2014 21:46:07 -0700 (PDT) Received: by 10.107.138.149 with HTTP; Thu, 14 Aug 2014 21:46:07 -0700 (PDT) In-Reply-To: References: Date: Thu, 14 Aug 2014 22:46:07 -0600 Message-ID: Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems From: Calvin To: user@hadoop.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org I've looked a bit into this problem some more, and from what another person has written, HDFS is tuned to scale appropriately [1] given the number of input splits, etc. In the case of utilizing the local filesystem (which is really a network share on a parallel filesystem), the settings might be set conservatively in order not to thrash the local disks or present a bottleneck in processing. Since this isn't a big concern, I'd rather tune the settings to efficiently utilize the local filesystem. Are there any pointers to where in the source code I could look in order to tweak such parameters? Thanks, Calvin [1] https://stackoverflow.com/questions/25269964/hadoop-yarn-and-task-parallelization-on-non-hdfs-filesystems On Tue, Aug 12, 2014 at 12:29 PM, Calvin wrote: > Hi all, > > I've instantiated a Hadoop 2.4.1 cluster and I've found that running > MapReduce applications will parallelize differently depending on what > kind of filesystem the input data is on. > > Using HDFS, a MapReduce job will spawn enough containers to maximize > use of all available memory. For example, a 3-node cluster with 172GB > of memory with each map task allocating 2GB, about 86 application > containers will be created. > > On a filesystem that isn't HDFS (like NFS or in my use case, a > parallel filesystem), a MapReduce job will only allocate a subset of > available tasks (e.g., with the same 3-node cluster, about 25-40 > containers are created). Since I'm using a parallel filesystem, I'm > not as concerned with the bottlenecks one would find if one were to > use NFS. > > Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml) > configuration that will allow me to effectively maximize resource > utilization? > > Thanks, > Calvin