Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 869D9E80C for ; Thu, 30 May 2013 13:00:43 +0000 (UTC) Received: (qmail 2608 invoked by uid 500); 30 May 2013 13:00:38 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 2385 invoked by uid 500); 30 May 2013 13:00:37 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 2353 invoked by uid 99); 30 May 2013 13:00:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 May 2013 13:00:36 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of roland.von.herget@gmail.com designates 74.125.82.178 as permitted sender) Received: from [74.125.82.178] (HELO mail-we0-f178.google.com) (74.125.82.178) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 May 2013 13:00:32 +0000 Received: by mail-we0-f178.google.com with SMTP id q56so205972wes.23 for ; Thu, 30 May 2013 06:00:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=MbHPgQVyUTW0KlIgAPQaRoirwHejx33RnJeXJbFiguc=; b=Mv2Mb4iJD1CmZlHi92FX6/8MrL5xXnooPxupIgq4XMGXXiB/V57OSMT6YJJHyJNuLi tqSoLMOwYx/RV+vdwRSOSS4M75C2UZHmYkiIrhT0kNM/InLvKDo4/ftU3yTuhjACtSD/ kiGjsykR5XX3OIoZ4gFJ7VsC/Ej4nW2XvccIqA9rezU1KiqdK63bLbvIsXC6OhwamVw9 8r5XRqlMzV/ZnNc0f8oJfn4LAWu344BygeySYHpdJmHbAipYCwhp2U5qLYxIxWMBOF+r zLpO7/H9cLmjDT5TdmijP9DZj8JoVFzM89mjOW29HTf7uKi0xHKGmi2vaGfpTLXlgCcK TqZg== MIME-Version: 1.0 X-Received: by 10.180.21.193 with SMTP id x1mr4415178wie.31.1369918810838; Thu, 30 May 2013 06:00:10 -0700 (PDT) Received: by 10.194.85.164 with HTTP; Thu, 30 May 2013 06:00:10 -0700 (PDT) In-Reply-To: References: <519F66AC.8040401@corp.badoo.com> <51A31C6D.6070909@corp.badoo.com> Date: Thu, 30 May 2013 15:00:10 +0200 Message-ID: Subject: Re: Please help me with heartbeat storm From: Roland von Herget To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7bb70982175eeb04ddef1414 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bb70982175eeb04ddef1414 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Philippe, thanks a lot, that's the solution. I've disable * mapreduce.tasktracker.outofband.heartbeat* and now everything is fine! Thanks again, Roland On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret < philippe.signoret@gmail.com> wrote: > This might be relevant: > https://issues.apache.org/jira/browse/MAPREDUCE-4478 > > "There are two configuration items to control the TaskTracker's heartbeat > interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other i= s > *mapreduce.tasktracker.outofband.heartbeat.damper*. If we set * > mapreduce.tasktracker.outofband.heartbeat* with true and set* > mapreduce.tasktracker.outofband.heartbeat.damper* with default value > (1000000), TaskTracker may send heartbeat without any interval." > > > Philippe > > ------------------------------- > *Philippe Signoret* > > > On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan < > rajesh.balamohan@gmail.com> wrote: > >> Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) >> =3D 3000 always. This is the reason why you are seeing so many heartbeat= s. >> *You might want to set it to 1 or 5.* This would increase the time taken >> to send the heartbeat from TT to JT. >> >> >> ~Rajesh.B >> >> >> On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey < >> a.eremihin@corp.badoo.com> wrote: >> >>> Hi! >>> >>> Tried 5 seconds. Less number of nodes get into storm, but still they do= . >>> Additionaly update of ntp service helped a little. >>> >>> Initially almost 50% got into storming each MR job. But after ntp updat= e >>> and and increasing heart-beatto 5 seconds level is around 10%. >>> >>> >>> On 26/05/13 10:43, murali adireddy wrote: >>> >>> Hi , >>> >>> Just try this one. >>> >>> in the file "hdfs-site.xml" try to add the below property >>> "dfs.heartbeat.interval" and value in seconds. >>> >>> Default value is '3' seconds. In your case increase value. >>> >>> >>> dfs.heartbeat.interval >>> 3 >>> >>> >>> You can find more properties and default values in the below link. >>> >>> >>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/h= dfs-default.xml >>> >>> >>> Please let me know is the above solution worked for you ..? >>> >>> >>> >>> >>> On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey < >>> a.eremihin@corp.badoo.com> wrote: >>> >>>> Hi all, >>>> I have 29 servers hadoop cluster in almost default configuration. >>>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste >>>> CPU. >>>> I started stracing its behaviour and found that some TT send heartbeat= s >>>> in an unlimited ways. >>>> It means hundreds in a second. >>>> >>>> Daemon restart solves the issue, but even easiest Hive MR returns issu= e >>>> back. >>>> >>>> Here is the filtered strace of heartbeating process >>>> >>>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1 | grep 6065 = | >>>> grep write >>>> >>>> >>>> [pid 6065] 13:07:34.801106 write(70, >>>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.Tas= kTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9= .mlan:localhost/ >>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\= 1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\37= 7\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0= \0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30", >>>> 284) =3D 284 >>>> [pid 6065] 13:07:34.807968 write(70, >>>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.Tas= kTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9= .mlan:localhost/ >>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\= 1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\37= 7\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0= \0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31", >>>> 284 >>>> [pid 6065] 13:07:34.808080 <... write resumed> ) =3D 284 >>>> [pid 6065] 13:07:34.814473 write(70, >>>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.Tas= kTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9= .mlan:localhost/ >>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\= 1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\37= 7\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0= \0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32", >>>> 284 >>>> [pid 6065] 13:07:34.814595 <... write resumed> ) =3D 284 >>>> [pid 6065] 13:07:34.820960 write(70, >>>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.Tas= kTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9= .mlan:localhost/ >>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\= 1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\37= 7\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0= \0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33", >>>> 284 >>>> >>>> >>>> Please help me to stop this storming 8( >>>> >>>> >>> >>> >> >> >> -- >> ~Rajesh.B >> > > --047d7bb70982175eeb04ddef1414 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi Philippe,

thanks a lot, that&#= 39;s the solution. I've disable=A0mapreduce.tasktrack= er.outofband.heartbeat=A0and now everything is fine!

Thanks again,
Roland
<= /div>


On Wed, = May 29, 2013 at 4:00 PM, Philippe Signoret <philippe.signoret@gm= ail.com> wrote:
This might be relevant:=A0<= a href=3D"https://issues.apache.org/jira/browse/MAPREDUCE-4478" target=3D"_= blank">https://issues.apache.org/jira/browse/MAPREDUCE-4478

"There are two configuration items to contro= l the TaskTracker's heartbeat interval. One is=A0mapreduce.tasktracker.outofband.heartbeat. The = other ismapreduce.tasktracker.outofband.heartbeat.= damper. If we set=A0mapre= duce.tasktracker.outofband.heartbeat=A0with true a= nd setmapreduce.tasktracker.outofband.heartbeat.da= mper=A0with default value (1000000), TaskTracker m= ay send heartbeat without any interval."

Philippe

-------------------------------
Philippe Signoret


On Tue, May 28, 2013 at 4:44 AM, Rajesh = Balamohan <rajesh.balamohan@gmail.com> wrote:
Default value of=A0CL= USTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) =3D 3000 always. This = is the reason why you are seeing so many heartbeats. You might want to = set it to 1 or 5. This would increase the time taken to send the heartb= eat from TT to JT.


<= div>~Rajesh.B
=


On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <a.eremihin@corp= .badoo.com> wrote:
=20 =20 =20
Hi!

Tried 5 seconds. Less number of nodes get into storm, but still they do.
Additionaly update of ntp service helped a little.

Initially almost 50% got into storming each MR job. But after ntp update and and increasing heart-beatto 5 seconds level is around 10%.


On 26/05/13 10:43, murali adireddy wrote:
Hi ,

Just try this one.

in the file "hdfs-site.xml" try to add the below property "dfs.heartbeat.interval" and value =A0in secon= ds.

Default value is '3' seconds. In your case increase value.

<property>
=A0<name>dfs.heartbeat.interval</name>
=A0<value>3</value>
</property>

You can find more=A0properties=A0and default values in the below link.



Please let me know is the above solution worked for you ..?




On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <a.ere= mihin@corp.badoo.com> wrote:
Hi all,
I have 29 servers hadoop cluster in almost default configuration.
After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
I started stracing its behaviour and found that some TT send heartbeats in an unlimited ways.
It means hundreds in a second.

Daemon restart solves the issue, but even easiest Hive MR returns issue back.

Here is the filtered strace of heartbeating process

hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1 =A0| grep 6065 | grep write


[pid =A06065] 13:07:34.801106 write(70, "\0\0\1\30\0:\316N= \0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.ap= ache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355\fhadoop9.= mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\22= 0\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\3= 3\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7b= oolean\0\0\7boolean\1\0\5short\316\30", 284) =3D 284
[pid =A06065] 13:07:34.807968 write(70, "\0\0\1\30\0:\316O= \0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.ap= ache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355\fhadoop9.= mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\22= 0\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\3= 3\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7b= oolean\0\0\7boolean\1\0\5short\316\31", 284 <unfinished ...>
[pid =A06065] 13:07:34.808080 <... write resumed> ) =3D 284
[pid =A06065] 13:07:34.814473 write(70, "\0\0\1\30\0:\316P= \0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.ap= ache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355\fhadoop9.= mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\22= 0\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\3= 3\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7b= oolean\0\0\7boolean\1\0\5short\316\32", 284 <unfinished ...>
[pid =A06065] 13:07:34.814595 <... write resumed> ) =3D 284
[pid =A06065] 13:07:34.820960 write(70, "\0\0\1\30\0:\316Q= \0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.ap= ache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355\fhadoop9.= mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\22= 0\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\3= 3\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7b= oolean\0\0\7boolean\1\0\5short\316\33", 284 <unfinished ...>


Please help me to stop this storming 8(






<= font color=3D"#888888">--
~Rajesh.B


--047d7bb70982175eeb04ddef1414--