Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 97FC295D0 for ; Wed, 21 Sep 2011 22:54:52 +0000 (UTC) Received: (qmail 71435 invoked by uid 500); 21 Sep 2011 22:54:52 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 71405 invoked by uid 500); 21 Sep 2011 22:54:52 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 71396 invoked by uid 99); 21 Sep 2011 22:54:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Sep 2011 22:54:52 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Sep 2011 22:54:48 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 54F25A72D8 for ; Wed, 21 Sep 2011 22:54:26 +0000 (UTC) Date: Wed, 21 Sep 2011 22:54:26 +0000 (UTC) From: "Chris Riccomini (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <661191352.245.1316645666344.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <890063188.52438.1316639589097.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-3065) ApplicationMaster killed by NodeManager due to excessive virtual memory consumption MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112196#comment-13112196 ] Chris Riccomini commented on MAPREDUCE-3065: -------------------------------------------- It worked! THANKS! That sucked. For the record, I just set export MALLOC_ARENA_MAX=${MALLOC_ARENA_MAX:-4}, and made sure echo $MALLOC_ARENA_MAX printed 4 in my RM shell, NM shell, and AM. pmap is now showing 1.3megs instead of 2 gigs. WOOT. Thanks again! > ApplicationMaster killed by NodeManager due to excessive virtual memory consumption > ----------------------------------------------------------------------------------- > > Key: MAPREDUCE-3065 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3065 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: nodemanager > Affects Versions: 0.24.0 > Reporter: Chris Riccomini > > > Hey Vinod, > > > > OK, so I have a little more clarity into this. > > > > When I bump my resource request for my AM to 4096, it runs. The important line in the NM logs is: > > > > 2011-09-21 13:43:44,366 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree 25656 for container-id container_1316637655278_0001_01_000001 : Virtual 2260938752 bytes, limit : 4294967296 bytes; Physical 120860672 bytes, limit -1 bytes > > > > The thing to note is the virtual memory, which is off the charts, even though my physical memory is almost nothing (12 megs). I'm still poking around the code, but I am noticing that there are two checks in the NM, one for virtual mem, and one for physical mem. The virtual memory check appears to be toggle-able, but is presumably defaulted to on. > > > > At this point I'm trying to figure out exactly what the VMEM check is for, why YARN thinks my app is taking 2 gigs, and how to fix this. > > > > Cheers, > > Chris > > ________________________________________ > > From: Chris Riccomini [criccomini@linkedin.com] > > Sent: Wednesday, September 21, 2011 1:42 PM > > To: mapreduce-dev@hadoop.apache.org > > Subject: Re: ApplicationMaster Memory Usage > > > > For the record, I bumped to 4096 for memory resource request, and it works. > > :( > > > > > > On 9/21/11 1:32 PM, "Chris Riccomini" wrote: > > > >> Hey Vinod, > >> > >> So, I ran my application master directly from the CLI. I commented out the > >> YARN-specific code. It runs fine without leaking memory. > >> > >> I then ran it from YARN, with all YARN-specific code commented it. It again > >> ran fine. > >> > >> I then uncommented JUST my registerWithResourceManager call. It then fails > >> with OOM after a few seconds. I call registerWithResourceManager, and then go > >> into a while(true) { println("yeh") sleep(1000) }. Doing this prints: > >> > >> yeh > >> yeh > >> yeh > >> yeh > >> yeh > >> > >> At which point, it dies, and, in the NodeManager,I see: > >> > >> 2011-09-21 13:24:51,036 WARN monitor.ContainersMonitorImpl > >> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process tree for > >> container: container_1316626117280_0005_01_000001 has processes older than 1 > >> iteration running over the configured limit. Limit=2147483648, current usage = > >> 2192773120 > >> 2011-09-21 13:24:51,037 WARN monitor.ContainersMonitorImpl > >> (ContainersMonitorImpl.java:run(453)) - Container > >> [pid=23852,containerID=container_1316626117280_0005_01_000001] is running > >> beyond memory-limits. Current usage : 2192773120bytes. Limit : > >> 2147483648bytes. Killing container. > >> Dump of the process-tree for container_1316626117280_0005_01_000001 : > >> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) > >> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > >> |- 23852 20570 23852 23852 (bash) 0 0 108638208 303 /bin/bash -c java -Xmx512M > >> -cp './package/*' kafka.yarn.ApplicationMaster > >> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 5 1 1316626117280 > >> com.linkedin.TODO 1 > >> 1>/tmp/logs/application_1316626117280_0005/container_1316626117280_0005_01_000 > >> 001/stdout > >> 2>/tmp/logs/application_1316626117280_0005/container_1316626117280_0005_01_000 > >> 001/stderr > >> |- 23855 23852 23852 23852 (java) 81 4 2084134912 14772 java -Xmx512M -cp > >> ./package/* kafka.yarn.ApplicationMaster > >> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 5 1 1316626117280 > >> com.linkedin.TODO 1 > >> 2011-09-21 13:24:51,037 INFO monitor.ContainersMonitorImpl > >> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with root 23852 > >> > >> Either something is leaking in YARN, or my registerWithResourceManager code > >> (see below) is doing something funky. > >> > >> I'm trying to avoid going through all the pain of attaching a remote debugger. > >> Presumably things aren't leaking in YARN, which means it's likely that I'm > >> doing something wrong in my registration code. > >> > >> Incidentally, my NodeManager is running with 1000 megs. My application master > >> memory is set to 2048, and my -Xmx setting is 512M > >> > >> Cheers, > >> Chris > >> ________________________________________ > >> From: Vinod Kumar Vavilapalli [vinodkv@hortonworks.com] > >> Sent: Wednesday, September 21, 2011 11:52 AM > >> To: mapreduce-dev@hadoop.apache.org > >> Subject: Re: ApplicationMaster Memory Usage > >> > >> Actually MAPREDUCE-2998 is only related to MRV2, so that isn't related. > >> > >> Somehow, your JVM itself is taking so much of virtual memory. Are you > >> loading some native libs? > >> > >> And how many containers have already been allocated by the time the AM > >> crashes. May be you are accumulating some per-container data. You can try > >> dumping heap vai hprof. > >> > >> +Vinod > >> > >> > >> On Wed, Sep 21, 2011 at 11:21 PM, Chris Riccomini > >> wrote: > >> > >>> Hey Vinod, > >>> > >>> I svn up'd, and rebuilt. My application's task (container) now runs! > >>> > >>> Unfortunately, my application master eventually gets killed by the > >>> NodeManager anyway, and I'm still not clear as to why. The AM is just > >>> running a loop, asking for a container, and executing a command in the > >>> container. It keeps doing this over and over again. After a few iterations, > >>> it gets killed with something like: > >>> > >>> 2011-09-21 10:42:40,869 INFO monitor.ContainersMonitorImpl > >>> (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree 21666 > >>> for container-id container_1316626117280_0002_01_000001 : Virtual 2260938752 > >>> bytes, limit : 2147483648 bytes; Physical 77398016 bytes, limit -1 bytes > >>> 2011-09-21 10:42:40,869 WARN monitor.ContainersMonitorImpl > >>> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process tree for > >>> container: container_1316626117280_0002_01_000001 has processes older than 1 > >>> iteration running over the configured limit. Limit=2147483648, current usage > >>> = 2260938752 > >>> 2011-09-21 10:42:40,870 WARN monitor.ContainersMonitorImpl > >>> (ContainersMonitorImpl.java:run(453)) - Container > >>> [pid=21666,containerID=container_1316626117280_0002_01_000001] is running > >>> beyond memory-limits. Current usage : 2260938752bytes. Limit : > >>> 2147483648bytes. Killing container. > >>> Dump of the process-tree for container_1316626117280_0002_01_000001 : > >>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > >>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > >>> |- 21669 21666 21666 21666 (java) 105 4 2152300544 18593 java > >>> -Xmx512M -cp ./package/* kafka.yarn.ApplicationMaster > >>> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 2 1 1316626117280 > >>> com.linkedin.TODO 1 > >>> |- 21666 20570 21666 21666 (bash) 0 0 108638208 303 /bin/bash -c > >>> java -Xmx512M -cp './package/*' kafka.yarn.ApplicationMaster > >>> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 2 1 1316626117280 > >>> com.linkedin.TODO 1 > >>> 1>/tmp/logs/application_1316626117280_0002/container_1316626117280_0002_01_00 > >>> 0001/stdout > >>> 2>/tmp/logs/application_1316626117280_0002/container_1316626117280_0002_01_00 > >>> 0001/stderr > >>> > >>> 2011-09-21 10:42:40,870 INFO monitor.ContainersMonitorImpl > >>> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with root 21666 > >>> > >>> I don't think that my AM is leaking memory. Full code paste after the break > >>> > >>> 1. Do I need to release a container in my AM even if the AM receives it as > >>> a finished container in the resource request response? > >>> 2. Do I need to free any other resources after a resource request (e.g. > >>> ResourceRequest, AllocateRequest, etc)? > >>> > >>> Cheers, > >>> Chris > >>> > >>> > >>> def main(args: Array[String]) { > >>> // YARN will always give our ApplicationMaster > >>> // the first four parameters as: > >>> > >>> val packagePath = args(0) > >>> val appId = args(1).toInt > >>> val attemptId = args(2).toInt > >>> val timestamp = args(3).toLong > >>> > >>> // these are our application master's parameters > >>> val streamerClass = args(4) > >>> val tasks = args(5).toInt > >>> > >>> // TODO log params here > >>> > >>> // start the application master helper > >>> val conf = new Configuration > >>> val applicationMasterHelper = new ApplicationMasterHelper(appId, > >>> attemptId, timestamp, conf) > >>> .registerWithResourceManager > >>> > >>> // start and manage the slaves > >>> val noReleases = List[ContainerId]() > >>> var runningContainers = 0 > >>> > >>> // keep going forever > >>> while (true) { > >>> val nonRunningTasks = tasks - runningContainers > >>> val response = > >>> applicationMasterHelper.sendResourceRequest(nonRunningTasks, noReleases) > >>> > >>> response.getAllocatedContainers.foreach(container => { > >>> new ContainerExecutor(packagePath, container) > >>> .addCommand("java -Xmx256M -cp './package/*' > >>> kafka.yarn.StreamingTask " + streamerClass + " " > >>> + "1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout " > >>> + "2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + > >>> "/stderr").execute(conf) > >>> }) > >>> > >>> runningContainers += response.getAllocatedContainers.length > >>> runningContainers -= response.getCompletedContainersStatuses.length > >>> > >>> Thread.sleep(1000) > >>> } > >>> > >>> applicationMasterHelper.unregisterWithResourceManager("SUCCESS") > >>> } > >>> > >>> > >>> class ApplicationMasterHelper(iAppId: Int, iAppAttemptId: Int, lTimestamp: > >>> Long, conf: Configuration) { > >>> val rpc = YarnRPC.create(conf) > >>> val appId = Records.newRecord(classOf[ApplicationId]) > >>> val appAttemptId = Records.newRecord(classOf[ApplicationAttemptId]) > >>> val rmAddress = > >>> NetUtils.createSocketAddr(conf.get(YarnConfiguration.RM_SCHEDULER_ADDRESS, > >>> YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS)) > >>> val resourceManager = rpc.getProxy(classOf[AMRMProtocol], rmAddress, > >>> conf).asInstanceOf[AMRMProtocol] > >>> var requestId = 0 > >>> > >>> appId.setClusterTimestamp(lTimestamp) > >>> appId.setId(iAppId) > >>> appAttemptId.setApplicationId(appId) > >>> appAttemptId.setAttemptId(iAppAttemptId) > >>> > >>> def registerWithResourceManager(): ApplicationMasterHelper = { > >>> val req = Records.newRecord(classOf[RegisterApplicationMasterRequest]) > >>> req.setApplicationAttemptId(appAttemptId) > >>> // TODO not sure why these are blank- This is how spark does it > >>> req.setHost("") > >>> req.setRpcPort(1) > >>> req.setTrackingUrl("") > >>> resourceManager.registerApplicationMaster(req) > >>> this > >>> } > >>> > >>> def unregisterWithResourceManager(state: String): ApplicationMasterHelper > >>> = { > >>> val finReq = Records.newRecord(classOf[FinishApplicationMasterRequest]) > >>> finReq.setAppAttemptId(appAttemptId) > >>> finReq.setFinalState(state) > >>> resourceManager.finishApplicationMaster(finReq) > >>> this > >>> } > >>> > >>> def sendResourceRequest(containers: Int, release: List[ContainerId]): > >>> AMResponse = { > >>> // TODO will need to make this more flexible for hostname requests, etc > >>> val request = Records.newRecord(classOf[ResourceRequest]) > >>> val pri = Records.newRecord(classOf[Priority]) > >>> val capability = Records.newRecord(classOf[Resource]) > >>> val req = Records.newRecord(classOf[AllocateRequest]) > >>> request.setHostName("*") > >>> request.setNumContainers(containers) > >>> pri.setPriority(1) > >>> request.setPriority(pri) > >>> capability.setMemory(128) > >>> request.setCapability(capability) > >>> req.setResponseId(requestId) > >>> req.setApplicationAttemptId(appAttemptId) > >>> req.addAllAsks(Lists.newArrayList(request)) > >>> req.addAllReleases(release) > >>> requestId += 1 > >>> // TODO we might want to return a list of container executors here > >>> instead of AMResponses > >>> resourceManager.allocate(req).getAMResponse > >>> } > >>> } > >>> > >>> > >>> ________________________________________ > >>> From: Vinod Kumar Vavilapalli [vinodkv@hortonworks.com] > >>> Sent: Wednesday, September 21, 2011 10:08 AM > >>> To: mapreduce-dev@hadoop.apache.org > >>> Subject: Re: ApplicationMaster Memory Usage > >>> > >>> Yes, the process-dump clearly tells that this is MAPREDUCE-2998. > >>> > >>> +Vinod > >>> (With a smirk to see his container-memory-monitoring code in action) > >>> > >>> > >>> On Wed, Sep 21, 2011 at 10:26 PM, Arun C Murthy > >>> wrote: > >>> > >>>> I'll bet you are hitting MR-2998. > >>>> > >>>> From the changelog: > >>>> > >>>> MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to fork > >>>> bin/mapred too many times. Contributed by Vinod K V. > >>>> > >>>> Arun > >>>> > >>>> On Sep 21, 2011, at 9:52 AM, Chris Riccomini wrote: > >>>> > >>>>> Hey Guys, > >>>>> > >>>>> My ApplicationMaster is being killed by the NodeManager because of > >>> memory > >>>> consumption, and I don't understand why. I'm using -Xmx512M, and setting > >>> my > >>>> resource request to 2048. > >>>>> > >>>>> > >>>>> .addCommand("java -Xmx512M -cp './package/*' > >>>> kafka.yarn.ApplicationMaster " ... > >>>>> > >>>>> ... > >>>>> > >>>>> private var memory = 2048 > >>>>> > >>>>> resource.setMemory(memory) > >>>>> containerCtx.setResource(resource) > >>>>> containerCtx.setCommands(cmds.toList) > >>>>> containerCtx.setLocalResources(Collections.singletonMap("package", > >>>> packageResource)) > >>>>> appCtx.setApplicationId(appId) > >>>>> appCtx.setUser(user.getShortUserName) > >>>>> appCtx.setAMContainerSpec(containerCtx) > >>>>> request.setApplicationSubmissionContext(appCtx) > >>>>> applicationsManager.submitApplication(request) > >>>>> > >>>>> When this runs, I see (in my NodeManager's logs): > >>>>> > >>>>> > >>>>> 2011-09-21 09:35:19,112 INFO monitor.ContainersMonitorImpl > >>>> (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree 28134 > >>>> for container-id container_1316559026783_0003_01_000001 : Virtual > >>> 2260938752 > >>>> bytes, limit : 2147483648 bytes; Physical 71540736 bytes, limit -1 bytes > >>>>> 2011-09-21 09:35:19,112 WARN monitor.ContainersMonitorImpl > >>>> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process tree > >>> for > >>>> container: container_1316559026783_0003_01_000001 has processes older > >>> than 1 > >>>> iteration running over the configured limit. Limit=2147483648, current > >>> usage > >>>> = 2260938752 > >>>>> 2011-09-21 09:35:19,113 WARN monitor.ContainersMonitorImpl > >>>> (ContainersMonitorImpl.java:run(453)) - Container > >>>> [pid=28134,containerID=container_1316559026783_0003_01_000001] is running > >>>> beyond memory-limits. Current usage : 2260938752bytes. Limit : > >>>> 2147483648bytes. Killing container. > >>>>> Dump of the process-tree for container_1316559026783_0003_01_000001 : > >>>>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > >>>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > >>>>> |- 28134 25886 28134 28134 (bash) 0 0 108638208 303 /bin/bash -c > >>>> java -Xmx512M -cp './package/*' kafka.yarn.ApplicationMaster 3 1 > >>>> 1316559026783 com.linkedin.TODO 1 > >>>> > >>> 1>/tmp/logs/application_1316559026783_0003/container_1316559026783_0003_01_00 > >>> 0001/stdout > >>>> > >>> 2>/tmp/logs/application_1316559026783_0003/container_1316559026783_0003_01_00 > >>> 0001/stderr > >>>>> |- 28137 28134 28134 28134 (java) 92 3 2152300544 17163 java > >>>> -Xmx512M -cp ./package/* kafka.yarn.ApplicationMaster 3 1 1316559026783 > >>>> com.linkedin.TODO 1 > >>>>> > >>>>> 2011-09-21 09:35:19,113 INFO monitor.ContainersMonitorImpl > >>>> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with root > >>> 28134 > >>>>> > >>>>> It appears that YARN is honoring my 2048 command, yet my process is > >>>> somehow taking 2260938752 bytes. I don't think that I'm using nearly that > >>>> much in permgen, and my heap is limited to 512. I don't have any JNI > >>> stuff > >>>> running (that I know of), so it's unclear to me what's going on here. The > >>>> only thing that I can think of is that Java's Runtime exec is forking, > >>> and > >>>> copying its entire JVM memory footprint for the fork. > >>>>> > >>>>> Has anyone seen this? Am I doing something dumb? > >>>>> > >>>>> Thanks! > >>>>> Chris > >>>> > >>>> > >>> > > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira