Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3BDABDCF5 for ; Fri, 27 Jul 2012 23:23:02 +0000 (UTC) Received: (qmail 56928 invoked by uid 500); 27 Jul 2012 23:22:58 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 56855 invoked by uid 500); 27 Jul 2012 23:22:58 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 56845 invoked by uid 99); 27 Jul 2012 23:22:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Jul 2012 23:22:58 +0000 X-ASF-Spam-Status: No, hits=1.8 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FSL_RCVD_USER,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of anilgupta84@gmail.com designates 209.85.160.176 as permitted sender) Received: from [209.85.160.176] (HELO mail-gh0-f176.google.com) (209.85.160.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Jul 2012 23:22:53 +0000 Received: by ghbz10 with SMTP id z10so4360302ghb.35 for ; Fri, 27 Jul 2012 16:22:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=8x5cdMpYLBtGpCgwfJYVTQlyyL/fkaokukqwGiSPEEU=; b=mxnlMRm8PNwo6+cLNCXZyAzA91lCbSSbSjCchJcSuUWoY3EEQWRlcgjVHj4FE8k2Am F1Gj289dJmSm8jBsSFzt8BlhndpJyYYZwBZYkHOXGO0C3PaT+PIkUCRf+LgoN8PUR5AT lh/n9Cd83NvODONq4ISPn7s+QDBXpNQcKDI1kfMMDFschpP97uFuryqv2ZeE8QdjgyZD xnsObP2tpeINQp/ZweVZp6SOxnpJDRz5hu0vAlCSA3qgc2jBpGNFWrDOcnbVX1WNMWNj Fi5iTfy8MrFO5nZIC8SgrC8UDm0C5O+9RkuUbJHZrK2DUiBhW+uyGR3JWlxpd8Q/qlVk f2KQ== Received: by 10.50.197.194 with SMTP id iw2mr3448362igc.31.1343431352103; Fri, 27 Jul 2012 16:22:32 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.63.12 with HTTP; Fri, 27 Jul 2012 16:22:11 -0700 (PDT) In-Reply-To: References: From: anil gupta Date: Fri, 27 Jul 2012 16:22:11 -0700 Message-ID: Subject: Re: YARN Pi example job stuck at 0%(No MR tasks are started by ResourceManager) To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=14dae93404cd85b13204c5d7fc65 X-Virus-Checked: Checked by ClamAV on apache.org --14dae93404cd85b13204c5d7fc65 Content-Type: text/plain; charset=ISO-8859-1 Hi Harsh, Thanks a lot for your response. I am going to try your suggestions and let you know the outcome. I am running the cluster on VMWare hypervisor. I have 3 physical machines with 16GB of RAM, and 4TB( 2 HD of 2TB each). On every machine i am running 4 VM's. Each VM is having 3.2 GB of memory. I built this cluster for trying out HA(NN, ZK, HMaster) since we are little reluctant to deploy anything without HA in prod. This cluster is supposed to be used as HBase cluster and MR is going to be used only for Bulk Loading. Also, my data dump is around 10 GB(which is pretty small for Hadoop). I am going to load this data in 4 different schema which will be roughly 150 million records for HBase. So, i think i will lower down the memory requirement of Yarn for my use case rather than reducing the number of data nodes to increase the memory of remaining Data Nodes. Do you think this will be the right approach for my cluster environment? Also, on a side note, shouldn't the NodeManager throw an error on this kind of memory problem? Should i file a JIRA for this? It just sat quietly over there. Thanks a lot, Anil Gupta On Fri, Jul 27, 2012 at 3:36 PM, Harsh J wrote: > Hi, > > The 'root' doesn't matter. You may run jobs as any username on an > unsecured cluster, should be just the same. > > The config yarn.nodemanager.resource.memory-mb = 1200 is your issue. > By default, the tasks will execute with a resource demand of 1 GB, and > the AM itself demands, by default, 1.5 GB to run. None of your nodes > are hence able to start your AM (demand=1500mb) and hence if the AM > doesn't start, your job won't initiate either. > > You can do a few things: > > 1. Raise yarn.nodemanager.resource.memory-mb to a value close to 4 GB > perhaps, if you have the RAM? Think of it as the new 'slots' divider. > The larger the offering (close to total RAM you can offer for > containers from the machine), the more the tasks that may run on it > (depending on their own demand, of course). Reboot the NM's one by one > and this app will begin to execute. > 2. Lower the AM's requirement, i.e. lower > yarn.app.mapreduce.am.resource.mb in your client's mapred-site.xml or > job config from 1500 to 1000 or less, so it fits in the NM's offering. > Likewise, control the map and reduce's requests via > mapreduce.map.memory.mb and mapreduce.reduce.memory.mb as needed. > Resubmit the job with these lowered requirements and things should now > work. > > Optionally, you may also cap the max/min possible requests via > "yarn.scheduler.minimum-allocation-mb" and > "yarn.scheduler.maximum-allocation-mb", such that no app/job ends up > demanding more than a certain limit and hence run into the > 'forever-waiting' state as in your case. > > Hope this helps! For some communication diagrams on how an app (such > as MR2, etc.) may work on YARN and how the resource negotiation works, > you can check out this post from Ahmed at > http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/ > > On Sat, Jul 28, 2012 at 3:35 AM, anil gupta wrote: > > Hi Harsh, > > > > I have set the *yarn.nodemanager.resource.memory-mb *to 1200 mb. Also, > does > > it matters if i run the jobs as "root" while the RM service and NM > service > > are running as "yarn" user? However, i have created the /user/root > > directory for root user in hdfs. > > > > Here is the yarn-site.xml: > > > > > > yarn.nodemanager.aux-services > > mapreduce.shuffle > > > > > > > > yarn.nodemanager.aux-services.mapreduce.shuffle.class > > org.apache.hadoop.mapred.ShuffleHandler > > > > > > > > yarn.log-aggregation-enable > > true > > > > > > > > List of directories to store localized files > > in. > > yarn.nodemanager.local-dirs > > /disk/yarn/local > > > > > > > > Where to store container logs. > > yarn.nodemanager.log-dirs > > /disk/yarn/logs > > > > > > > > Where to aggregate logs to. > > yarn.nodemanager.remote-app-log-dir > > /var/log/hadoop-yarn/apps > > > > > > > > Classpath for typical applications. > > yarn.application.classpath > > > > $HADOOP_CONF_DIR, > > $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*, > > $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*, > > $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*, > > $YARN_HOME/*,$YARN_HOME/lib/* > > > > > > > > yarn.resourcemanager.resource-tracker.address > > ihub-an-l1:8025 > > > > > > yarn.resourcemanager.address > > ihub-an-l1:8040 > > > > > > yarn.resourcemanager.scheduler.address > > ihub-an-l1:8030 > > > > > > yarn.resourcemanager.admin.address > > ihub-an-l1:8141 > > > > > > yarn.resourcemanager.webapp.address > > ihub-an-l1:8088 > > > > > > mapreduce.jobhistory.intermediate-done-dir > > /disk/mapred/jobhistory/intermediate/done > > > > > > mapreduce.jobhistory.done-dir > > /disk/mapred/jobhistory/done > > > > > > > > yarn.web-proxy.address > > ihub-an-l1:9999 > > > > > > yarn.app.mapreduce.am.staging-dir > > /user > > > > > > * > > Amount of physical memory, in MB, that can be allocated > > for containers. > > yarn.nodemanager.resource.memory-mb > > 1200 > > * > > > > > > > > > > > > > > On Fri, Jul 27, 2012 at 2:23 PM, Harsh J wrote: > > > >> Can you share your yarn-site.xml contents? Have you tweaked memory > >> sizes in there? > >> > >> On Fri, Jul 27, 2012 at 11:53 PM, anil gupta > >> wrote: > >> > Hi All, > >> > > >> > I have a Hadoop 2.0 alpha(cdh4) hadoop/hbase cluster runnning on > >> > CentOS6.0. The cluster has 4 admin nodes and 8 data nodes. I have the > RM > >> > and History server running on one machine. RM web interface shows > that 8 > >> > Nodes are connected to it. I installed this cluster with HA capability > >> and > >> > I have already tested HA for Namenodes, ZK, HBase Master. I am running > >> the > >> > pi example mapreduce job with user "root" and i have created > "/user/root" > >> > directory in HDFS. > >> > > >> > Last few lines of one of the nodemanager: > >> > 2012-07-26 21:58:38,745 INFO org.mortbay.log: Extract > >> > > >> > jar:file:/usr/lib/hadoop-yarn/hadoop-yarn-common-2.0.0-cdh4.0.0.jar!/webapps/node > >> > to /tmp/Jetty_0_0_0_0_8042_node____19tj0x/webapp > >> > 2012-07-26 21:58:38,907 INFO org.mortbay.log: Started > >> > SelectChannelConnector@0.0.0.0:8042 > >> > 2012-07-26 21:58:38,907 INFO org.apache.hadoop.yarn.webapp.WebApps: > Web > >> app > >> > /node started at 8042 > >> > 2012-07-26 21:58:38,919 INFO org.apache.hadoop.yarn.webapp.WebApps: > >> > Registered webapp guice modules > >> > 2012-07-26 21:58:38,919 INFO > >> > org.apache.hadoop.yarn.service.AbstractService: > >> > Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is > >> > started. > >> > 2012-07-26 21:58:38,919 INFO > >> > org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is > >> > started. > >> > 2012-07-26 21:58:38,922 INFO > >> > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: > >> Connected > >> > to ResourceManager at ihub-an-l1/172.31.192.151:8025 > >> > 2012-07-26 21:58:38,924 INFO > >> > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: > >> Registered > >> > with ResourceManager as ihub-dn-l2:53199 with total resource of > memory: > >> 1200 > >> > 2012-07-26 21:58:38,924 INFO > >> > org.apache.hadoop.yarn.service.AbstractService: > >> > > Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl > >> is > >> > started. > >> > 2012-07-26 21:58:38,929 INFO > >> > org.apache.hadoop.yarn.service.AbstractService: > >> > Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is > started. > >> > *2012-07-26 21:58:38,929 INFO > >> > org.apache.hadoop.yarn.service.AbstractService: > >> > > Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl > >> is > >> > stopped.* > >> > > >> > Why is the nodestatusupdaterImpl is stopped? > >> > > >> > Here is the last few lines of the RM: > >> > 2012-07-27 09:38:24,644 INFO > >> > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: > Allocated > >> > new applicationId: 2 > >> > 2012-07-27 09:38:25,310 INFO > >> > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: > >> Application > >> > with id 2 submitted by user root > >> > 2012-07-27 09:38:25,310 INFO > >> > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root > >> > IP=172.31.192.51 OPERATION=Submit Application Request > >> > TARGET=ClientRMService RESULT=SUCCESS > >> APPID=application_1343365114818_0002 > >> > 2012-07-27 09:38:25,310 INFO > >> > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > >> > application_1343365114818_0002 State change from NEW to SUBMITTED > >> > 2012-07-27 09:38:25,311 INFO > >> > > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: > >> > Registering appattempt_1343365114818_0002_000001 > >> > 2012-07-27 09:38:25,311 INFO > >> > > >> > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > >> > appattempt_1343365114818_0002_000001 State change from NEW to > SUBMITTED > >> > 2012-07-27 09:38:25,311 INFO > >> > > >> > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: > >> > Application Submission: application_1343365114818_0002 from root, > >> currently > >> > active: 1 > >> > 2012-07-27 09:38:25,311 INFO > >> > > >> > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > >> > appattempt_1343365114818_0002_000001 State change from SUBMITTED to > >> > SCHEDULED > >> > 2012-07-27 09:38:25,311 INFO > >> > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > >> > application_1343365114818_0002 State change from SUBMITTED to ACCEPTED > >> > > >> > The Pi example job is stuck from last 1 hour. Why it is not trying to > >> start > >> > tasks in NM's? > >> > > >> > Here is the command i fired to run the job: > >> > [root@ihub-nn-a1 hadoop-yarn]# hadoop --config /etc/hadoop/conf/ jar > >> > /usr/lib/hadoop-mapreduce/hadoop-*-examples.jar pi 10 100000 > >> > Number of Maps = 10 > >> > Samples per Map = 100000 > >> > Wrote input for Map #0 > >> > Wrote input for Map #1 > >> > Wrote input for Map #2 > >> > Wrote input for Map #3 > >> > Wrote input for Map #4 > >> > Wrote input for Map #5 > >> > Wrote input for Map #6 > >> > Wrote input for Map #7 > >> > Wrote input for Map #8 > >> > Wrote input for Map #9 > >> > Starting Job > >> > 12/07/27 09:38:27 INFO input.FileInputFormat: Total input paths to > >> process > >> > : 10 > >> > 12/07/27 09:38:27 INFO mapreduce.JobSubmitter: number of splits:10 > >> > 12/07/27 09:38:27 WARN conf.Configuration: mapred.jar is deprecated. > >> > Instead, use mapreduce.job.jar > >> > 12/07/27 09:38:27 WARN conf.Configuration: > >> > mapred.map.tasks.speculative.execution is deprecated. Instead, use > >> > mapreduce.map.speculative > >> > 12/07/27 09:38:27 WARN conf.Configuration: mapred.reduce.tasks is > >> > deprecated. Instead, use mapreduce.job.reduces > >> > 12/07/27 09:38:27 WARN conf.Configuration: mapred.output.value.class > is > >> > deprecated. Instead, use mapreduce.job.output.value.class > >> > 12/07/27 09:38:27 WARN conf.Configuration: > >> > mapred.reduce.tasks.speculative.execution is deprecated. Instead, use > >> > mapreduce.reduce.speculative > >> > 12/07/27 09:38:27 WARN conf.Configuration: mapreduce.map.class is > >> > deprecated. Instead, use mapreduce.job.map.class > >> > 12/07/27 09:38:27 WARN conf.Configuration: mapred.job.name is > >> deprecated. > >> > Instead, use mapreduce.job.name > >> > 12/07/27 09:38:27 WARN conf.Configuration: mapreduce.reduce.class is > >> > deprecated. Instead, use mapreduce.job.reduce.class > >> > 12/07/27 09:38:27 WARN conf.Configuration: > mapreduce.inputformat.class is > >> > deprecated. Instead, use mapreduce.job.inputformat.class > >> > 12/07/27 09:38:27 WARN conf.Configuration: mapred.input.dir is > >> deprecated. > >> > Instead, use mapreduce.input.fileinputformat.inputdir > >> > 12/07/27 09:38:27 WARN conf.Configuration: mapred.output.dir is > >> deprecated. > >> > Instead, use mapreduce.output.fileoutputformat.outputdir > >> > 12/07/27 09:38:27 WARN conf.Configuration: > mapreduce.outputformat.class > >> is > >> > deprecated. Instead, use mapreduce.job.outputformat.class > >> > 12/07/27 09:38:27 WARN conf.Configuration: mapred.map.tasks is > >> deprecated. > >> > Instead, use mapreduce.job.maps > >> > 12/07/27 09:38:27 WARN conf.Configuration: mapred.output.key.class is > >> > deprecated. Instead, use mapreduce.job.output.key.class > >> > 12/07/27 09:38:27 WARN conf.Configuration: mapred.working.dir is > >> > deprecated. Instead, use mapreduce.job.working.dir > >> > 12/07/27 09:38:27 INFO mapred.ResourceMgrDelegate: Submitted > application > >> > application_1343365114818_0002 to ResourceManager at ihub-an-l1/ > >> > 172.31.192.151:8040 > >> > 12/07/27 09:38:27 INFO mapreduce.Job: The url to track the job: > >> > http://ihub-an-l1:9999/proxy/application_1343365114818_0002/ > >> > 12/07/27 09:38:27 INFO mapreduce.Job: Running job: > job_1343365114818_0002 > >> > > >> > No Map-Reduce task are started by the cluster. I dont see any errors > >> > anywhere in the application. Please help me in resolving this problem. > >> > > >> > Thanks, > >> > Anil Gupta > >> > >> > >> > >> -- > >> Harsh J > >> > > > > > > > > -- > > Thanks & Regards, > > Anil Gupta > > > > -- > Harsh J > -- Thanks & Regards, Anil Gupta --14dae93404cd85b13204c5d7fc65--