hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roland von Herget <roland.von.her...@gmail.com>
Subject Re: Please help me with heartbeat storm
Date Sat, 25 May 2013 15:44:16 GMT
Hi Alexey,

I don't know the solution to this problem, but I can second this, I'm
seeing nearly the same:
My TaskTrackers are flooding the JobTracker with heartbeats, this starts
after the first mapred job and can be repaired by restarting the
TaskTracker.
The TT nodes have high system cpu usage stats, the JT is not suffering from
this.

my environment:
debian 6.0.7
hadoop 1.0.4
java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

What's your environment?

--Roland


On Fri, May 24, 2013 at 3:10 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

> Hi all,
> I have 29 servers hadoop cluster in almost default configuration.
> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
> I started stracing its behaviour and found that some TT send heartbeats in
> an unlimited ways.
> It means hundreds in a second.
>
> Daemon restart solves the issue, but even easiest Hive MR returns issue
> back.
>
> Here is the filtered strace of heartbeating process
>
> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
> grep write
>
>
> [pid  6065] 13:07:34.801106 write(70, "\0\0\1\30\0:\316N\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\300\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\30", 284) = 284
> [pid  6065] 13:07:34.807968 write(70, "\0\0\1\30\0:\316O\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\312\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\31", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
> [pid  6065] 13:07:34.814473 write(70, "\0\0\1\30\0:\316P\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\32", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
> [pid  6065] 13:07:34.820960 write(70, "\0\0\1\30\0:\316Q\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\33", 284 <unfinished
> ...>
>
>
> Please help me to stop this storming 8(
>
>

Mime
View raw message