Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of roland.von.herget@gmail.com
 designates 74.125.82.178 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CA+JmpbX7LYTG_=eGHafT5VCmF6oYLNnDGstMW22KBAJZsbxASg@mail.gmail.com>
References: <519F66AC.8040401@corp.badoo.com>
	<CAL1ybVEK6rSSy8m30aY9CRPiEPXZzRJBkzPD5wD9_TJsEhf1WQ@mail.gmail.com>
	<51A31C6D.6070909@corp.badoo.com>
	<CAJqL3EJ4V3qQXi0DVB4S5YzXqOLLXm3DzrdUS=cRUnqmrci3og@mail.gmail.com>
	<CA+JmpbX7LYTG_=eGHafT5VCmF6oYLNnDGstMW22KBAJZsbxASg@mail.gmail.com>
Date: Thu, 30 May 2013 15:00:10 +0200
Message-ID: 
 <CAGNCQpM9Qu8rEwexCWAQMGGvcVL8Q9DXoN_jyuh5R4VCR2W9qQ@mail.gmail.com>
Subject: Re: Please help me with heartbeat storm
From: Roland von Herget <roland.von.herget@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7bb70982175eeb04ddef1414

--047d7bb70982175eeb04ddef1414
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Philippe,

thanks a lot, that's the solution. I've disable *
mapreduce.tasktracker.outofband.heartbeat* and now everything is fine!

Thanks again,
Roland


On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret <
philippe.signoret@gmail.com> wrote:

> This might be relevant:
> https://issues.apache.org/jira/browse/MAPREDUCE-4478
>
> "There are two configuration items to control the TaskTracker's heartbeat
> interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other i=
s
> *mapreduce.tasktracker.outofband.heartbeat.damper*. If we set *
> mapreduce.tasktracker.outofband.heartbeat* with true and set*
> mapreduce.tasktracker.outofband.heartbeat.damper* with default value
> (1000000), TaskTracker may send heartbeat without any interval."
>
>
> Philippe
>
> -------------------------------
> *Philippe Signoret*
>
>
> On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com> wrote:
>
>> Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000)
>> =3D 3000 always. This is the reason why you are seeing so many heartbeat=
s.
>> *You might want to set it to 1 or 5.* This would increase the time taken
>> to send the heartbeat from TT to JT.
>>
>>
>> ~Rajesh.B
>>
>>
>> On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <
>> a.eremihin@corp.badoo.com> wrote:
>>
>>>  Hi!
>>>
>>> Tried 5 seconds. Less number of nodes get into storm, but still they do=
.
>>> Additionaly update of ntp service helped a little.
>>>
>>> Initially almost 50% got into storming each MR job. But after ntp updat=
e
>>> and and increasing heart-beatto 5 seconds level is around 10%.
>>>
>>>
>>> On 26/05/13 10:43, murali adireddy wrote:
>>>
>>> Hi ,
>>>
>>>  Just try this one.
>>>
>>>  in the file "hdfs-site.xml" try to add the below property
>>> "dfs.heartbeat.interval" and value  in seconds.
>>>
>>>  Default value is '3' seconds. In your case increase value.
>>>
>>>  <property>
>>>  <name>dfs.heartbeat.interval</name>
>>>  <value>3</value>
>>> </property>
>>>
>>>  You can find more properties and default values in the below link.
>>>
>>>
>>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/h=
dfs-default.xml
>>>
>>>
>>>  Please let me know is the above solution worked for you ..?
>>>
>>>
>>>
>>>
>>>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
>>> a.eremihin@corp.badoo.com> wrote:
>>>
>>>> Hi all,
>>>> I have 29 servers hadoop cluster in almost default configuration.
>>>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste
>>>> CPU.
>>>> I started stracing its behaviour and found that some TT send heartbeat=
s
>>>> in an unlimited ways.
>>>> It means hundreds in a second.
>>>>
>>>> Daemon restart solves the issue, but even easiest Hive MR returns issu=
e
>>>> back.
>>>>
>>>> Here is the filtered strace of heartbeating process
>>>>
>>>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 =
|
>>>> grep write
>>>>
>>>>
>>>> [pid  6065] 13:07:34.801106 write(70,
>>>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.Tas=
kTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9=
.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\=
1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\37=
7\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0=
\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>>> 284) =3D 284
>>>> [pid  6065] 13:07:34.807968 write(70,
>>>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.Tas=
kTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9=
.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\=
1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\37=
7\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0=
\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>>> 284 <unfinished ...>
>>>> [pid  6065] 13:07:34.808080 <... write resumed> ) =3D 284
>>>> [pid  6065] 13:07:34.814473 write(70,
>>>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.Tas=
kTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9=
.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\=
1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\37=
7\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0=
\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>>> 284 <unfinished ...>
>>>> [pid  6065] 13:07:34.814595 <... write resumed> ) =3D 284
>>>> [pid  6065] 13:07:34.820960 write(70,
>>>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.Tas=
kTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9=
.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\=
1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\37=
7\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0=
\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>>> 284 <unfinished ...>
>>>>
>>>>
>>>> Please help me to stop this storming 8(
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>

--047d7bb70982175eeb04ddef1414
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Philippe,<div><br></div><div style>thanks a lot, that&#=
39;s the solution. I&#39;ve disable=A0<b style=3D"font-size:13px;line-heigh=
t:17px;font-family:Arial,FreeSans,Helvetica,sans-serif">mapreduce.tasktrack=
er.outofband.heartbeat</b>=A0and now everything is fine!</div>
<div style><br></div><div style>Thanks again,</div><div style>Roland</div><=
/div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Wed, =
May 29, 2013 at 4:00 PM, Philippe Signoret <span dir=3D"ltr">&lt;<a href=3D=
"mailto:philippe.signoret@gmail.com" target=3D"_blank">philippe.signoret@gm=
ail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">This might be relevant:=A0<=
a href=3D"https://issues.apache.org/jira/browse/MAPREDUCE-4478" target=3D"_=
blank">https://issues.apache.org/jira/browse/MAPREDUCE-4478</a><div>
<br></div><blockquote style=3D"margin:0 0 0 40px;border:none;padding:0px">


<div>&quot;<span style=3D"line-height:17px;font-size:13px;font-family:Arial=
,FreeSans,Helvetica,sans-serif">There are two configuration items to contro=
l the TaskTracker&#39;s heartbeat interval. One is=A0</span><b style=3D"lin=
e-height:17px;font-size:13px;font-family:Arial,FreeSans,Helvetica,sans-seri=
f">mapreduce.tasktracker.outofband.heartbeat</b><span style=3D"line-height:=
17px;font-size:13px;font-family:Arial,FreeSans,Helvetica,sans-serif">. The =
other is</span><b style=3D"line-height:17px;font-size:13px;font-family:Aria=
l,FreeSans,Helvetica,sans-serif">mapreduce.tasktracker.outofband.heartbeat.=
damper</b><span style=3D"line-height:17px;font-size:13px;font-family:Arial,=
FreeSans,Helvetica,sans-serif">. If we set=A0</span><b style=3D"line-height=
:17px;font-size:13px;font-family:Arial,FreeSans,Helvetica,sans-serif">mapre=
duce.tasktracker.outofband.heartbeat</b><span style=3D"line-height:17px;fon=
t-size:13px;font-family:Arial,FreeSans,Helvetica,sans-serif">=A0with true a=
nd set</span><b style=3D"line-height:17px;font-size:13px;font-family:Arial,=
FreeSans,Helvetica,sans-serif">mapreduce.tasktracker.outofband.heartbeat.da=
mper</b><span style=3D"line-height:17px;font-size:13px;font-family:Arial,Fr=
eeSans,Helvetica,sans-serif">=A0with default value (1000000), TaskTracker m=
ay send heartbeat without any interval.&quot;</span></div>


</blockquote><div><div><br></div><div>Philippe</div></div><div class=3D"gma=
il_extra"><br clear=3D"all"><div><div>-------------------------------<br><b=
>Philippe Signoret</b></div></div><div><div class=3D"h5">
<br><br><div class=3D"gmail_quote">On Tue, May 28, 2013 at 4:44 AM, Rajesh =
Balamohan <span dir=3D"ltr">&lt;<a href=3D"mailto:rajesh.balamohan@gmail.co=
m" target=3D"_blank">rajesh.balamohan@gmail.com</a>&gt;</span> wrote:<br><b=
lockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px =
#ccc solid;padding-left:1ex">


<div dir=3D"ltr">Default value of=A0<span style=3D"white-space:pre-wrap">CL=
USTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) =3D 3000 always. This =
is the reason why you are seeing so many heartbeats. <b>You might  want to =
set it to 1 or 5.</b> This would increase the time taken to send the heartb=
eat from TT to JT.</span><div>


<br></div><div><div><span style=3D"white-space:pre-wrap"><br></span></div><=
div><span style=3D"white-space:pre-wrap">~Rajesh.B</span></div></div></div>=
<div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quote">

On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <span dir=3D"ltr">&lt;<a =
href=3D"mailto:a.eremihin@corp.badoo.com" target=3D"_blank">a.eremihin@corp=
.badoo.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


 =20
   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">
    Hi!<br>
    <br>
    Tried 5 seconds. Less number of nodes get into storm, but still they
    do.<br>
    Additionaly update of ntp service helped a little.<br>
    <br>
    Initially almost 50% got into storming each MR job. But after ntp
    update and and increasing heart-beatto 5 seconds level is around
    10%. <br><div><div>
    <br>
    <br>
    <div>On 26/05/13 10:43, murali adireddy
      wrote:<br>
    </div>
    <blockquote type=3D"cite">
      <div dir=3D"ltr">Hi ,
        <div><br>
        </div>
        <div>Just try this one.</div>
        <div><br>
        </div>
        <div>in the file &quot;hdfs-site.xml&quot; try to add the below
          property &quot;dfs.heartbeat.interval&quot; and value =A0in secon=
ds.</div>
        <div><br>
        </div>
        <div>Default value is &#39;3&#39; seconds. In your case
          increase value.</div>
        <div><br>
        </div>
        <div>
          <div>&lt;property&gt;</div>
          <div>=A0&lt;name&gt;dfs.heartbeat.interval&lt;/name&gt;</div>
          <div>=A0&lt;value&gt;3&lt;/value&gt;</div>
          <div>&lt;/property&gt;</div>
          <div><br>
          </div>
          <div>You can find more=A0properties=A0and default values
            in the below link.</div>
          <div><br>
          </div>
          <div><a href=3D"http://hadoop.apache.org/docs/current/hadoop-proj=
ect-dist/hadoop-hdfs/hdfs-default.xml" target=3D"_blank">http://hadoop.apac=
he.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml</a><br=
>


          </div>
          <div><br>
          </div>
          <div><br>
          </div>
          <div>Please let me know is the above solution worked
            for you ..?</div>
          <div><br>
          </div>
          <div><br>
          </div>
        </div>
      </div>
      <div class=3D"gmail_extra"><br>
        <br>
        <div class=3D"gmail_quote">
          On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <span dir=3D"lt=
r">&lt;<a href=3D"mailto:a.eremihin@corp.badoo.com" target=3D"_blank">a.ere=
mihin@corp.badoo.com</a>&gt;</span>
          wrote:<br>
          <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">
            Hi all,<br>
            I have 29 servers hadoop cluster in almost default
            configuration.<br>
            After installing Hadoop 1.0.4 I&#39;ve noticed that JT and some
            TT waste CPU.<br>
            I started stracing its behaviour and found that some TT send
            heartbeats in an unlimited ways.<br>
            It means hundreds in a second.<br>
            <br>
            Daemon restart solves the issue, but even easiest Hive MR
            returns issue back.<br>
            <br>
            Here is the filtered strace of heartbeating process<br>
            <br>
            hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032
            2&gt;&amp;1 =A0| grep 6065 | grep write<br>
            <br>
            <br>
            [pid =A06065] 13:07:34.801106 write(70, &quot;\0\0\1\30\0:\316N=
\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.ap=
ache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/<a href=
=3D"http://127.0.0.1:52355" target=3D"_blank">127.0.0.1:52355</a>\fhadoop9.=
mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\22=
0\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\3=
3\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7b=
oolean\0\0\7boolean\1\0\5short\316\30&quot;,
            284) =3D 284<br>
            [pid =A06065] 13:07:34.807968 write(70, &quot;\0\0\1\30\0:\316O=
\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.ap=
ache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/<a href=
=3D"http://127.0.0.1:52355" target=3D"_blank">127.0.0.1:52355</a>\fhadoop9.=
mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\22=
0\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\3=
3\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7b=
oolean\0\0\7boolean\1\0\5short\316\31&quot;,
            284 &lt;unfinished ...&gt;<br>
            [pid =A06065] 13:07:34.808080 &lt;... write resumed&gt; ) =3D
            284<br>
            [pid =A06065] 13:07:34.814473 write(70, &quot;\0\0\1\30\0:\316P=
\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.ap=
ache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/<a href=
=3D"http://127.0.0.1:52355" target=3D"_blank">127.0.0.1:52355</a>\fhadoop9.=
mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\22=
0\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\3=
3\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7b=
oolean\0\0\7boolean\1\0\5short\316\32&quot;,
            284 &lt;unfinished ...&gt;<br>
            [pid =A06065] 13:07:34.814595 &lt;... write resumed&gt; ) =3D
            284<br>
            [pid =A06065] 13:07:34.820960 write(70, &quot;\0\0\1\30\0:\316Q=
\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.ap=
ache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/<a href=
=3D"http://127.0.0.1:52355" target=3D"_blank">127.0.0.1:52355</a>\fhadoop9.=
mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\22=
0\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\3=
3\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7b=
oolean\0\0\7boolean\1\0\5short\316\33&quot;,
            284 &lt;unfinished ...&gt;<br>
            <br>
            <br>
            Please help me to stop this storming 8(<br>
            <br>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>~Rajesh.B
</font></span></div>
</blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>

--047d7bb70982175eeb04ddef1414--