Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8CF3111D62 for ; Thu, 11 Sep 2014 10:24:58 +0000 (UTC) Received: (qmail 60919 invoked by uid 500); 11 Sep 2014 10:24:52 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 60766 invoked by uid 500); 11 Sep 2014 10:24:52 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 60756 invoked by uid 99); 11 Sep 2014 10:24:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Sep 2014 10:24:51 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of stransky.ja@gmail.com designates 209.85.217.172 as permitted sender) Received: from [209.85.217.172] (HELO mail-lb0-f172.google.com) (209.85.217.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Sep 2014 10:24:25 +0000 Received: by mail-lb0-f172.google.com with SMTP id w7so7602321lbi.3 for ; Thu, 11 Sep 2014 03:24:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=U564dfJx8gogwEfY9QjiI5UNLbPyoAzV5AuOt2/tBKg=; b=FLBdNUVHTnpeP24q//JSt5WIK6xAenysfdcOZX4/Gg65qu3qfV7DwZiV7HB6+JkSGz sHmG4xvqBehLJw0BxyqCrdHblaOeYuJhcInB1qu7qgdkKZMAJMFk/BAUl1hPiivPhBtv ew9XvEqlejDVXJ2LxA93ZJc4IPrsM2mlb/yPoFC5JX013E+Jkyjml88GxOw6BBlHCBZa 71Uvki7h4CzLq6rqP0ZTIhZoFWwnaol4eoN1auCWWYvwEiCqVb30DQWPTWILirAr1HyL U/SMQj2ZG79CjUc09AuXaPguKnEYuZ1G/fsrlUh7fdEmVC6mbMZPXopraGO/3iXnfBZB qyeQ== MIME-Version: 1.0 X-Received: by 10.112.158.170 with SMTP id wv10mr28204lbb.66.1410431064690; Thu, 11 Sep 2014 03:24:24 -0700 (PDT) Received: by 10.114.198.131 with HTTP; Thu, 11 Sep 2014 03:24:24 -0700 (PDT) In-Reply-To: References: Date: Thu, 11 Sep 2014 12:24:24 +0200 Message-ID: Subject: Re: virtual memory consumption From: Jakub Stransky To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c37fce9756be0502c79286 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c37fce9756be0502c79286 Content-Type: text/plain; charset=UTF-8 Hi, thanks for reply. Machine is pretty small as it has 4GB of total memory. So we reserved 1GB for OS, 1GB HBase (according to recommendation) so remains 2GB thats what nodemanager claims. Actually it is a cluster of 5machines, 2 name-nodes and 3 data nodes. All machines has similar parameters so the stronger ones are used for nn and rest for dn. I know that hw is far away from ideal but it is a small cluster for a POC and gaining some experiences. Back to the problem. At the time when this happens no other job is running on cluster. All mappers (3) has already finished and we have single reduce task which fails at ~ 70% of its progress on virtual memory consumption. Dataset which is processing is 500MB of avro data file compressed. Reducer doesn't cache anything intentionally, just divide a records in various folders dynamically. >From RM console I clearly see that there is a free unused resources - memory. Is there a way how to detect what consumed that assigned virtual memory? Because for a smaller amount of input data ~ 120MB compressed data - job finishes just fine within 3 min. We have obviously a problem in scaling the task out. Could someone provide some hints as it seems that we are missing something fundamental here. Thanks for helping me out Jakub On 11 September 2014 11:34, Susheel Kumar Gadalay wrote: > Your physical memory is 1GB on this node. > > What are the other containers (map tasks) running on this? > > You have given map memory as 768M and reduce memory as 1024M and am as > 1024M. > > With AM and a single map task it is 1.7M and cannot start another > container for reducer. > Reduce these values and check. > > On 9/11/14, Jakub Stransky wrote: > > Hello hadoop users, > > > > I am facing following issue when running M/R job during a reduce phase: > > > > Container [pid=22961,containerID=container_1409834588043_0080_01_000010] > is > > running beyond virtual memory limits. Current usage: 636.6 MB of 1 GB > > physical memory used; 2.1 GB of 2.1 GB virtual memory used. > > Killing container. Dump of the process-tree for > > container_1409834588043_0080_01_000010 : > > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > > SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > > |- 22961 16896 22961 22961 (bash) 0 0 > > 9424896 312 /bin/bash -c > > /usr/java/default/bin/java -Djava.net.preferIPv4Stack=true > > -Dhadoop.metrics.log.level=WARN -Xmx768m > > > -Djava.io.tmpdir=/home/hadoop/yarn/local/usercache/jobsubmit/appcache/application_1409834588043_0080/container_1409834588043_0080_01_000010/tmp > > -Dlog4j.configuration=container-log4j.properties > > > -Dyarn.app.container.log.dir=/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_000010 > > -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA > > org.apache.hadoop.mapred.YarnChild 153.87.47.116 47184 > > attempt_1409834588043_0080_r_000000_0 10 > > > 1>/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_000010/stdout > > > 2>/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_000010/stderr > > |- 22970 22961 22961 22961 (java) 24692 1165 2256662528 162659 > > /usr/java/default/bin/java -Djava.net.preferIPv4Stack=true > > -Dhadoop.metrics.log.level=WARN -Xmx768m > > > -Djava.io.tmpdir=/home/hadoop/yarn/local/usercache/jobsubmit/appcache/application_1409834588043_0080/container_1409834588043_0080_01_000010/tmp > > -Dlog4j.configuration=container-log4j.properties > > > -Dyarn.app.container.log.dir=/home/hadoop/yarn/logs/application_1409834588043_0080/container_1409834588043_0080_01_000010 > > -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA > > org.apache.hadoop.mapred.YarnChild 153.87.47.116 47184 > > attempt_1409834588043_0080_r_000000_0 10 Container killed on request. > Exit > > code is 143 > > > > > > I have following settings with default ratio physical to vm set to 2.1 : > > # hadoop - yarn-site.xml > > yarn.nodemanager.resource.memory-mb : 2048 > > yarn.scheduler.minimum-allocation-mb : 256 > > yarn.scheduler.maximum-allocation-mb : 2048 > > > > # hadoop - mapred-site.xml > > mapreduce.map.memory.mb : 768 > > mapreduce.map.java.opts : -Xmx512m > > mapreduce.reduce.memory.mb : 1024 > > mapreduce.reduce.java.opts : -Xmx768m > > mapreduce.task.io.sort.mb : 100 > > yarn.app.mapreduce.am.resource.mb : 1024 > > yarn.app.mapreduce.am.command-opts : -Xmx768m > > > > I have following questions: > > - Is it possible to track down the vm consumption? Find what was the > cause > > for such a high vm. > > - What is the best way to solve this kind of problems? > > - I found following recommendation on the internet: " We actually > recommend > > disabling this check by setting yarn.nodemanager.vmem-check-enabled to > false > > as > > there is reason to believe the virtual/physical ratio is exceptionally > high > > with some versions of Java / Linux." Is it a good way to go? > > > > My reduce task doesn't perform any super activity - just classify data, > for > > a given input key chooses the appropriate output folder and writes the > data > > out. > > > > Thanks for any advice > > Jakub > > > -- Jakub Stransky cz.linkedin.com/in/jakubstransky --001a11c37fce9756be0502c79286 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

thanks for reply. Machine is pretty= small as it has 4GB of total memory. So we reserved 1GB for OS, 1GB HBase = (according to recommendation) so remains 2GB thats what nodemanager claims.=

Actually it is a cluster of 5machines, 2 name-nod= es and 3 data nodes. All machines has similar parameters so the stronger on= es are used for nn and rest for dn. I know that hw is far away from ideal b= ut it is a small cluster for a POC and gaining some experiences.
=
Back to the problem. At the time when this happens no other = job is running on cluster. All mappers (3) has already finished and we have= single reduce task which fails at ~ 70% of its progress on virtual memory = consumption. Dataset which is processing is 500MB of avro data file compres= sed. Reducer doesn't cache anything intentionally, just divide a record= s in various folders dynamically.
From RM console I clearly see t= hat there is a free unused resources - memory. Is there a way how to detect= what consumed that assigned virtual memory? =C2=A0Because for a smaller am= ount of input data ~ 120MB compressed data - job finishes just fine within = 3 min.

We have obviously a problem in scaling the = task out. Could someone provide some hints as it seems that we are missing = something fundamental here.

Thanks for helping me = out
Jakub

On 11 September 2014 11:34, Susheel Kumar Gadalay <skgad= alay@gmail.com> wrote:
Your= physical memory is 1GB on this node.

What are the other containers (map tasks) running on this?

You have given map memory as 768M and reduce memory as 1024M and am as 1024= M.

With AM and a single map task it is 1.7M and cannot start another
container for reducer.
Reduce these values and check.

On 9/11/14, Jakub Stransky <str= ansky.ja@gmail.com> wrote:
> Hello hadoop users,
>
> I am facing following issue when running M/R job during a reduce phase= :
>
> Container [pid=3D22961,containerID=3Dcontainer_1409834588043_0080_01_0= 00010] is
> running beyond virtual memory limits. Current usage: 636.6 MB of 1 GB<= br> > physical memory used; 2.1 GB of 2.1 GB virtual memory used.
> Killing container. Dump of the process-tree for
> container_1409834588043_0080_01_000010 :
> |- PID=C2=A0 =C2=A0 PPID=C2=A0 PGRPID SESSID CMD_NAME USER_MODE_TIME(M= ILLIS)
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LIN= E
> |- 22961=C2=A0 16896 22961=C2=A0 22961=C2=A0 (bash)=C2=A0 =C2=A0 0=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0=
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A09424896=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0312=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0/bin/bash -c
> /usr/java/default/bin/java -Djava.net.preferIPv4Stack=3Dtrue
> -Dhadoop.metrics.log.level=3DWARN -Xmx768m
> -Djava.io.tmpdir=3D/home/hadoop/yarn/local/usercache/jobsubmit/appcach= e/application_1409834588043_0080/container_1409834588043_0080_01_000010/tmp=
> -Dlog4j.configuration=3Dcontainer-log4j.properties
> -Dyarn.app.container.log.dir=3D/home/hadoop/yarn/logs/application_1409= 834588043_0080/container_1409834588043_0080_01_000010
> -Dyarn.app.container.log.filesize=3D0 -Dhadoop.root.logger=3DINFO,CLA<= br> > org.apache.hadoop.mapred.YarnChild 153.87.47.116 47184
> attempt_1409834588043_0080_r_000000_0 10
> 1>/home/hadoop/yarn/logs/application_1409834588043_0080/container_1= 409834588043_0080_01_000010/stdout
> 2>/home/hadoop/yarn/logs/application_1409834588043_0080/container_1= 409834588043_0080_01_000010/stderr
> |- 22970 22961 22961 22961 (java) 24692 1165 2256662528 162659
> /usr/java/default/bin/java -Djava.net.preferIPv4Stack=3Dtrue
> -Dhadoop.metrics.log.level=3DWARN -Xmx768m
> -Djava.io.tmpdir=3D/home/hadoop/yarn/local/usercache/jobsubmit/appcach= e/application_1409834588043_0080/container_1409834588043_0080_01_000010/tmp=
> -Dlog4j.configuration=3Dcontainer-log4j.properties
> -Dyarn.app.container.log.dir=3D/home/hadoop/yarn/logs/application_1409= 834588043_0080/container_1409834588043_0080_01_000010
> -Dyarn.app.container.log.filesize=3D0 -Dhadoop.root.logger=3DINFO,CLA<= br> > org.apache.hadoop.mapred.YarnChild 153.87.47.116 47184
> attempt_1409834588043_0080_r_000000_0 10 Container killed on request. = Exit
> code is 143
>
>
> I have following settings with default ratio physical to vm set to 2.1= :
> # hadoop - yarn-site.xml
> yarn.nodemanager.resource.memory-mb=C2=A0 : 2048
> yarn.scheduler.minimum-allocation-mb : 256
> yarn.scheduler.maximum-allocation-mb : 2048
>
> # hadoop - mapred-site.xml
> mapreduce.map.memory.mb=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 : 768
> mapreduce.map.java.opts=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 : -Xmx512m
> mapreduce.reduce.memory.mb=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0: 1= 024
> mapreduce.reduce.java.opts=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0: -= Xmx768m
> mapreduce.task.io.sort.mb=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 : 1= 00
> yarn.app.mapreduce.am.resource.mb=C2=A0 =C2=A0 : 1024
> yarn.app.mapreduce.am.command-opts=C2=A0 =C2=A0: -Xmx768m
>
> I have following questions:
> - Is it possible to track down the vm consumption? Find what was the c= ause
> for such a high vm.
> - What is the best way to solve this kind of problems?
> - I found following recommendation on the internet: " We actually= recommend
> disabling this check by setting yarn.nodemanager.vmem-check-enabled to= false
> as
> there is reason to believe the virtual/physical ratio is exceptionally= high
> with some versions of Java / Linux." Is it a good way to go?
>
> My reduce task doesn't perform any super activity - just classify = data, for
> a given input key chooses the appropriate output folder and writes the= data
> out.
>
> Thanks for any advice
> Jakub
>



--
= --001a11c37fce9756be0502c79286--