Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 31481D211 for ; Thu, 30 May 2013 19:03:21 +0000 (UTC) Received: (qmail 73283 invoked by uid 500); 30 May 2013 19:03:15 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 73162 invoked by uid 500); 30 May 2013 19:03:15 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 73155 invoked by uid 99); 30 May 2013 19:03:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 May 2013 19:03:15 +0000 X-ASF-Spam-Status: No, hits=3.9 required=5.0 tests=DNS_FROM_AHBL_RHSBL,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of a.eremihin@corp.badoo.com designates 209.85.215.53 as permitted sender) Received: from [209.85.215.53] (HELO mail-la0-f53.google.com) (209.85.215.53) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 May 2013 19:03:08 +0000 Received: by mail-la0-f53.google.com with SMTP id ea20so620845lab.12 for ; Thu, 30 May 2013 12:02:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=corp.badoo.com; s=google; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type; bh=KlZjRn6HrcDm1YXNDdTW9dbvs/yowE1XcAirKF3EfW0=; b=KGsqe10kNlf1cxrCpyWs+mxUX9c/N/tAIJOjQCf+WaIFgPO5XDYxCwBfN9Q6GmS7jH Xx9CEeS3ylNVZH34YChVv8DtnrCIFxUzDXRCS2F2fDQXqGeTLjQP6tfdPSm6IrXYbRw3 ka2x8dD+X7Ienwe69fuGfPdMU4NjklebTgYmI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:x-gm-message-state; bh=KlZjRn6HrcDm1YXNDdTW9dbvs/yowE1XcAirKF3EfW0=; b=jEhRM9LDa6RZ/VxOERQsnvJrRXpJvyVNZdWXfIjMV3QqNOiOFWSdxgNSQFK6IzF1P+ tw19B/j9R609/2XXTtiEtj3QqK5LrisV9Qm8iee0sTQZGLMv9A8Sahy3/e1oFKKYTJEj 1SANDvdsRyMyi1vqdBiDDznHMq9l8qkJnsvvS8391Lx7Y0KatF3j+B/VTsBaosSwILaz A6CHnYAnKq5yHcNiBXsAkFRhETkqAYkWlbiN0SHQAx+PxQ1NyVonlFX5WavSSF/izCiN pBuUvhYf/Yiok4ddX3+YLvgAqmVnyLrXTPhuIDR3SSg3DRf0rCe/5aIV2j50lqPtYOVy gCQA== X-Received: by 10.152.4.7 with SMTP id g7mr4173979lag.44.1369940567695; Thu, 30 May 2013 12:02:47 -0700 (PDT) Received: from [192.168.1.3] (alexxz.static.corbina.ru. [78.107.255.222]) by mx.google.com with ESMTPSA id 7sm10867611lax.1.2013.05.30.12.02.45 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 30 May 2013 12:02:46 -0700 (PDT) Message-ID: <51A7A250.2020106@corp.badoo.com> Date: Thu, 30 May 2013 23:02:40 +0400 From: Eremikhin Alexey User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130510 Thunderbird/17.0.6 MIME-Version: 1.0 To: user@hadoop.apache.org Subject: Re: Please help me with heartbeat storm References: <519F66AC.8040401@corp.badoo.com> <51A31C6D.6070909@corp.badoo.com> In-Reply-To: Content-Type: multipart/alternative; boundary="------------040302010103080103000708" X-Gm-Message-State: ALoCoQkOLQcxf2NVC8MD6SCi64G1s0HA8B9yEDrg9nl9iRZdxYYNVW0pLvR6fZE3U+OyaRWdGzMo X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. --------------040302010103080103000708 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit The same has helped me. Thanks a lot!! On 30.05.2013 17:00, Roland von Herget wrote: > Hi Philippe, > > thanks a lot, that's the solution. I've disable > *mapreduce.tasktracker.outofband.heartbeat* and now everything is fine! > > Thanks again, > Roland > > > On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret > > wrote: > > This might be relevant: > https://issues.apache.org/jira/browse/MAPREDUCE-4478 > > "There are two configuration items to control the > TaskTracker's heartbeat interval. One is > *mapreduce.tasktracker.outofband.heartbeat*. The other > is*mapreduce.tasktracker.outofband.heartbeat.damper*. If we > set *mapreduce.tasktracker.outofband.heartbeat* with true and > set*mapreduce.tasktracker.outofband.heartbeat.damper* with > default value (1000000), TaskTracker may send heartbeat > without any interval." > > > Philippe > > ------------------------------- > *Philippe Signoret* > > > On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan > > > wrote: > > Default value of CLUSTER_INCREMENT is 100. Math.max(1000* > 29/100, 3000) = 3000 always. This is the reason why you are > seeing so many heartbeats. *You might want to set it to 1 or > 5.* This would increase the time taken to send the heartbeat > from TT to JT. > > > ~Rajesh.B > > > On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey > > > wrote: > > Hi! > > Tried 5 seconds. Less number of nodes get into storm, but > still they do. > Additionaly update of ntp service helped a little. > > Initially almost 50% got into storming each MR job. But > after ntp update and and increasing heart-beatto 5 seconds > level is around 10%. > > > On 26/05/13 10:43, murali adireddy wrote: >> Hi , >> >> Just try this one. >> >> in the file "hdfs-site.xml" try to add the below property >> "dfs.heartbeat.interval" and value in seconds. >> >> Default value is '3' seconds. In your case increase value. >> >> >> dfs.heartbeat.interval >> 3 >> >> >> You can find more properties and default values in the >> below link. >> >> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml >> >> >> Please let me know is the above solution worked for you ..? >> >> >> >> >> On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey >> > > wrote: >> >> Hi all, >> I have 29 servers hadoop cluster in almost default >> configuration. >> After installing Hadoop 1.0.4 I've noticed that JT >> and some TT waste CPU. >> I started stracing its behaviour and found that some >> TT send heartbeats in an unlimited ways. >> It means hundreds in a second. >> >> Daemon restart solves the issue, but even easiest >> Hive MR returns issue back. >> >> Here is the filtered strace of heartbeating process >> >> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 >> 2>&1 | grep 6065 | grep write >> >> >> [pid 6065] 13:07:34.801106 write(70, >> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355 >> \fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30", >> 284) = 284 >> [pid 6065] 13:07:34.807968 write(70, >> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355 >> \fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31", >> 284 >> [pid 6065] 13:07:34.808080 <... write resumed> ) = 284 >> [pid 6065] 13:07:34.814473 write(70, >> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355 >> \fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32", >> 284 >> [pid 6065] 13:07:34.814595 <... write resumed> ) = 284 >> [pid 6065] 13:07:34.820960 write(70, >> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355 >> \fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33", >> 284 >> >> >> Please help me to stop this storming 8( >> >> > > > > > -- > ~Rajesh.B > > > --------------040302010103080103000708 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit The same has helped me.
Thanks a lot!!

On 30.05.2013 17:00, Roland von Herget wrote:
Hi Philippe,

thanks a lot, that's the solution. I've disable mapreduce.tasktracker.outofband.heartbeat and now everything is fine!

Thanks again,
Roland


On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret <philippe.signoret@gmail.com> wrote:
This might be relevant: https://issues.apache.org/jira/browse/MAPREDUCE-4478

"There are two configuration items to control the TaskTracker's heartbeat interval. One is mapreduce.tasktracker.outofband.heartbeat. The other ismapreduce.tasktracker.outofband.heartbeat.damper. If we set mapreduce.tasktracker.outofband.heartbeat with true and setmapreduce.tasktracker.outofband.heartbeat.damper with default value (1000000), TaskTracker may send heartbeat without any interval."

Philippe

-------------------------------
Philippe Signoret


On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan <rajesh.balamohan@gmail.com> wrote:
Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) = 3000 always. This is the reason why you are seeing so many heartbeats. You might want to set it to 1 or 5. This would increase the time taken to send the heartbeat from TT to JT.


~Rajesh.B


On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com> wrote:
Hi!

Tried 5 seconds. Less number of nodes get into storm, but still they do.
Additionaly update of ntp service helped a little.

Initially almost 50% got into storming each MR job. But after ntp update and and increasing heart-beatto 5 seconds level is around 10%.


On 26/05/13 10:43, murali adireddy wrote:
Hi ,

Just try this one.

in the file "hdfs-site.xml" try to add the below property "dfs.heartbeat.interval" and value  in seconds.

Default value is '3' seconds. In your case increase value.

<property>
 <name>dfs.heartbeat.interval</name>
 <value>3</value>
</property>

You can find more properties and default values in the below link.



Please let me know is the above solution worked for you ..?




On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com> wrote:
Hi all,
I have 29 servers hadoop cluster in almost default configuration.
After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
I started stracing its behaviour and found that some TT send heartbeats in an unlimited ways.
It means hundreds in a second.

Daemon restart solves the issue, but even easiest Hive MR returns issue back.

Here is the filtered strace of heartbeating process

hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 | grep write


[pid  6065] 13:07:34.801106 write(70, "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30", 284) = 284
[pid  6065] 13:07:34.807968 write(70, "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31", 284 <unfinished ...>
[pid  6065] 13:07:34.808080 <... write resumed> ) = 284
[pid  6065] 13:07:34.814473 write(70, "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32", 284 <unfinished ...>
[pid  6065] 13:07:34.814595 <... write resumed> ) = 284
[pid  6065] 13:07:34.820960 write(70, "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33", 284 <unfinished ...>


Please help me to stop this storming 8(






--
~Rajesh.B



--------------040302010103080103000708--