hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eremikhin Alexey <a.eremi...@corp.badoo.com>
Subject Re: Please help me with heartbeat storm
Date Sat, 25 May 2013 19:27:37 GMT
Hi Roland

Here are my conf.
SLES11 SP1
hadoop 1.0.4
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

It seems nothing repeats but hadoop version 8)

On 25.05.2013 19:44, Roland von Herget wrote:
> Hi Alexey,
>
> I don't know the solution to this problem, but I can second this, I'm 
> seeing nearly the same:
> My TaskTrackers are flooding the JobTracker with heartbeats, this 
> starts after the first mapred job and can be repaired by restarting 
> the TaskTracker.
> The TT nodes have high system cpu usage stats, the JT is not suffering 
> from this.
>
> my environment:
> debian 6.0.7
> hadoop 1.0.4
> java version "1.7.0_15"
> Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
>
> What's your environment?
>
> --Roland
>
>
> On Fri, May 24, 2013 at 3:10 PM, Eremikhin Alexey 
> <a.eremihin@corp.badoo.com <mailto:a.eremihin@corp.badoo.com>> wrote:
>
>     Hi all,
>     I have 29 servers hadoop cluster in almost default configuration.
>     After installing Hadoop 1.0.4 I've noticed that JT and some TT
>     waste CPU.
>     I started stracing its behaviour and found that some TT send
>     heartbeats in an unlimited ways.
>     It means hundreds in a second.
>
>     Daemon restart solves the issue, but even easiest Hive MR returns
>     issue back.
>
>     Here is the filtered strace of heartbeating process
>
>     hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep
>     6065 | grep write
>
>
>     [pid  6065] 13:07:34.801106 write(70,
>     "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>     284) = 284
>     [pid  6065] 13:07:34.807968 write(70,
>     "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.814473 write(70,
>     "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.820960 write(70,
>     "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>     284 <unfinished ...>
>
>
>     Please help me to stop this storming 8(
>
>


Mime
View raw message