hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eremikhin Alexey <a.eremi...@corp.badoo.com>
Subject Re: Please help me with heartbeat storm
Date Thu, 30 May 2013 19:02:40 GMT
The same has helped me.
Thanks a lot!!

On 30.05.2013 17:00, Roland von Herget wrote:
> Hi Philippe,
>
> thanks a lot, that's the solution. I've disable 
> *mapreduce.tasktracker.outofband.heartbeat* and now everything is fine!
>
> Thanks again,
> Roland
>
>
> On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret 
> <philippe.signoret@gmail.com <mailto:philippe.signoret@gmail.com>> wrote:
>
>     This might be relevant:
>     https://issues.apache.org/jira/browse/MAPREDUCE-4478
>
>         "There are two configuration items to control the
>         TaskTracker's heartbeat interval. One is
>         *mapreduce.tasktracker.outofband.heartbeat*. The other
>         is*mapreduce.tasktracker.outofband.heartbeat.damper*. If we
>         set *mapreduce.tasktracker.outofband.heartbeat* with true and
>         set*mapreduce.tasktracker.outofband.heartbeat.damper* with
>         default value (1000000), TaskTracker may send heartbeat
>         without any interval."
>
>
>     Philippe
>
>     -------------------------------
>     *Philippe Signoret*
>
>
>     On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan
>     <rajesh.balamohan@gmail.com <mailto:rajesh.balamohan@gmail.com>>
>     wrote:
>
>         Default value of CLUSTER_INCREMENT is 100. Math.max(1000*
>         29/100, 3000) = 3000 always. This is the reason why you are
>         seeing so many heartbeats. *You might want to set it to 1 or
>         5.* This would increase the time taken to send the heartbeat
>         from TT to JT.
>
>
>         ~Rajesh.B
>
>
>         On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey
>         <a.eremihin@corp.badoo.com <mailto:a.eremihin@corp.badoo.com>>
>         wrote:
>
>             Hi!
>
>             Tried 5 seconds. Less number of nodes get into storm, but
>             still they do.
>             Additionaly update of ntp service helped a little.
>
>             Initially almost 50% got into storming each MR job. But
>             after ntp update and and increasing heart-beatto 5 seconds
>             level is around 10%.
>
>
>             On 26/05/13 10:43, murali adireddy wrote:
>>             Hi ,
>>
>>             Just try this one.
>>
>>             in the file "hdfs-site.xml" try to add the below property
>>             "dfs.heartbeat.interval" and value  in seconds.
>>
>>             Default value is '3' seconds. In your case increase value.
>>
>>             <property>
>>              <name>dfs.heartbeat.interval</name>
>>              <value>3</value>
>>             </property>
>>
>>             You can find more properties and default values in the
>>             below link.
>>
>>             http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>
>>
>>             Please let me know is the above solution worked for you ..?
>>
>>
>>
>>
>>             On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey
>>             <a.eremihin@corp.badoo.com
>>             <mailto:a.eremihin@corp.badoo.com>> wrote:
>>
>>                 Hi all,
>>                 I have 29 servers hadoop cluster in almost default
>>                 configuration.
>>                 After installing Hadoop 1.0.4 I've noticed that JT
>>                 and some TT waste CPU.
>>                 I started stracing its behaviour and found that some
>>                 TT send heartbeats in an unlimited ways.
>>                 It means hundreds in a second.
>>
>>                 Daemon restart solves the issue, but even easiest
>>                 Hive MR returns issue back.
>>
>>                 Here is the filtered strace of heartbeating process
>>
>>                 hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032
>>                 2>&1  | grep 6065 | grep write
>>
>>
>>                 [pid  6065] 13:07:34.801106 write(70,
>>                 "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>                 284) = 284
>>                 [pid  6065] 13:07:34.807968 write(70,
>>                 "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>                 284 <unfinished ...>
>>                 [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>                 [pid  6065] 13:07:34.814473 write(70,
>>                 "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>                 284 <unfinished ...>
>>                 [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>                 [pid  6065] 13:07:34.820960 write(70,
>>                 "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>                 284 <unfinished ...>
>>
>>
>>                 Please help me to stop this storming 8(
>>
>>
>
>
>
>
>         -- 
>         ~Rajesh.B
>
>
>


Mime
View raw message