Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 915CE1735C for ; Mon, 26 Jan 2015 18:55:08 +0000 (UTC) Received: (qmail 84510 invoked by uid 500); 26 Jan 2015 18:55:03 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 84391 invoked by uid 500); 26 Jan 2015 18:55:03 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 84282 invoked by uid 99); 26 Jan 2015 18:55:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jan 2015 18:55:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cnauroth@hortonworks.com designates 209.85.216.47 as permitted sender) Received: from [209.85.216.47] (HELO mail-qa0-f47.google.com) (209.85.216.47) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jan 2015 18:54:35 +0000 Received: by mail-qa0-f47.google.com with SMTP id n8so8130195qaq.6 for ; Mon, 26 Jan 2015 10:53:03 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=OkuYm3WoQoDuXAAXdvjS0abDV1RKZHLJ4Xn6xZIvP+Q=; b=SBs1TbaGueOQufnvMasCiOD+HF1A2dXaqwx3HHdOIHdYDx4EknRSV2Jq/7zQnxAWlw M4ZdHx+k7hQ2VT0KHx2H5HUQ/6HRjLsO47e18aJYTBS2NNw0tXAMqDs4C1QDmy5ls+ZD 13/b53KVwE+xZXVOoTXRO/CgSyqzbYXskvOnecti3zvxF8Fxh4nJzdTsB7LY9w8l8SH6 N2xCHhYEIrU27LEaLOeBrlW0hzDH+fK94GyEm2/ff0OwrzpZx6wgrvqodQ6glqUYl4bS 4W1tBTnFBqiHHCdyn2yX6ArFGb8Nk50YYJBWF05yYpuyu/S+4wYb3VrraKGB3Pzw6iOt gOHg== X-Gm-Message-State: ALoCoQn5BD5PS4Mgl2+q/j7S82+egKzDLapYswb/9i5gLwvjp4Njql27445vZmJN5ZhKBQM/B7emIONwiSStxcuMabVLrIG2Mg5ECPAhYuXkHCy/BfpDNmg= MIME-Version: 1.0 X-Received: by 10.224.120.10 with SMTP id b10mr39014479qar.19.1422298383454; Mon, 26 Jan 2015 10:53:03 -0800 (PST) Received: by 10.96.34.34 with HTTP; Mon, 26 Jan 2015 10:53:03 -0800 (PST) In-Reply-To: References: <54C1F603.8020304@sql-ag.de> <54C60B3C.9080702@sql-ag.de> Date: Mon, 26 Jan 2015 10:53:03 -0800 Message-ID: Subject: Re: Time until a datanode is marked as dead From: Chris Nauroth To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c2ff70e92f57050d92a5a6 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2ff70e92f57050d92a5a6 Content-Type: text/plain; charset=UTF-8 I believe all properties related to stale datanode configuration are already covered in hdfs-default.xml, but dfs.namenode.heartbeat.recheck-interval is definitely missing. Frank, if you file the jira, then a nice benefit is that you'll get signed up automatically for notifications on it when someone makes progress on it. Chris Nauroth Hortonworks http://hortonworks.com/ On Mon, Jan 26, 2015 at 8:00 AM, Nicolas Liochon wrote: > Note that there is a difference between being dead and being stale. stale > means "avoid as much as possible" while dead means "avoid absolutely AND > initiate a recovery, i.e. copy all the data (typically 1 or more Tb)" > > There is some info on this blog entry: > http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/ > > Cheers, > > Nicolas > > > On Mon, Jan 26, 2015 at 10:46 AM, Azuryy Yu wrote: > >> Hi Frank, >> >> can you file an issue to add this configuration to the hdfs-default.xml? >> >> On Mon, Jan 26, 2015 at 5:39 PM, Frank Lanitz >> wrote: >> >>> Hi, >>> >>> Am 23.01.2015 um 19:23 schrieb Chris Nauroth: >>> > The time period for determining if a datanode is dead is calculated as >>> a >>> > function of a few different configuration properties. The current >>> > implementation in DatanodeManager.java does it like this: >>> > >>> > final long heartbeatIntervalSeconds = conf.getLong( >>> > DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, >>> > DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_DEFAULT); >>> > final int heartbeatRecheckInterval = conf.getInt( >>> > DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_KEY, >>> > DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_DEFAULT); >>> > // 5 minutes >>> > this.heartbeatExpireInterval = 2 * heartbeatRecheckInterval >>> > + 10 * 1000 * heartbeatIntervalSeconds; >>> >>> >>> Good to know. >>> >>> > Under default configuration, dfs.namenode.heartbeat.recheck-interval is >>> > 5 minutes and dfs.heartbeat.interval is 3 seconds. If we plug those >>> > values into the formula, we get 10.5 minutes, which agrees with your >>> > observation. If you change dfs.namenode.heartbeat.recheck-interval to >>> > 2.5 minutes, then you'll achieve an effective timeout of 5.5 minutes >>> > before a datanode is marked dead. >>> > >>> > dfs.namenode.heartbeat.recheck-interval is not documented in >>> > hdfs-default.xml, though I don't recall if that's an intentional choice >>> > or just an oversight. The value of the property must be expressed in >>> > milliseconds. >>> >>> This did the trick. Thank you very much. For testing porpuse I've set it >>> to 10000 and after approx 45s the node was marked as dead. >>> >>> Any chance to get this into a documented preference so possible behavior >>> changes with future releases can be spotted before staging area. >>> >>> cheers, >>> Frank >>> >> >> > -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. --001a11c2ff70e92f57050d92a5a6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I believe all properties related to stale datanode configu= ration are already covered in hdfs-default.xml, but=C2=A0dfs.namenode.heart= beat.recheck-interval is definitely missing.

Frank, if y= ou file the jira, then a nice benefit is that you'll get signed up auto= matically for notifications on it when someone makes progress on it.
<= div class=3D"gmail_extra">
Chris Nauroth
Hortonworks
http://hortonworks.= com/


On Mon, Jan 26, 2015 at 8:00 AM, Nicolas Lio= chon <nkeywal@gmail.com> wrote:
Note that there is a difference betwe= en being dead and being stale. stale means "avoid as much as possible&= quot; while dead means "avoid absolutely AND initiate a recovery, i.e.= copy all the data (typically 1 or more Tb)"
=
There is some info on this blog entry: http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recove= r-mttr/

Cheers,

Nicolas


On Mon, Jan 26, = 2015 at 10:46 AM, Azuryy Yu <azuryyyu@gmail.com> wrote:
=
Hi Frank,
can you file an issue to add this configuration to the hdfs-de= fault.xml?

On Mon, Jan 26, 2015 at 5:39 PM, Frank Lanitz <frank= .lanitz@sql-ag.de> wrote:
H= i,

Am 23.01.2015 um 19:23 schrieb Chris Nauroth:
> The time period for determining if a datanode is dead is calcula= ted as a
> function of a few different configuration properties.=C2=A0 The curren= t
> implementation in DatanodeManager.java does it like this:
>
>=C2=A0 =C2=A0 =C2=A0final long heartbeatIntervalSeconds =3D conf.getLon= g(
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_= KEY,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_= DEFAULT);
>=C2=A0 =C2=A0 =C2=A0final int heartbeatRecheckInterval =3D conf.getInt(=
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_= RECHECK_INTERVAL_KEY,
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_= RECHECK_INTERVAL_DEFAULT);
> // 5 minutes
>=C2=A0 =C2=A0 =C2=A0this.heartbeatExpireInterval =3D 2 * heartbeatReche= ckInterval
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+ 10 * 1000 * heartbeatIntervalSecond= s;


Good to know.

> Under default configuration, dfs.namenode.heartbeat.recheck-interval i= s
> 5 minutes and dfs.heartbeat.interval is 3 seconds.=C2=A0 If we plug th= ose
> values into the formula, we get 10.5 minutes, which agrees with your > observation.=C2=A0 If you change dfs.namenode.heartbeat.recheck-interv= al to
> 2.5 minutes, then you'll achieve an effective timeout of 5.5 minut= es
> before a datanode is marked dead.
>
> dfs.namenode.heartbeat.recheck-interval is not documented in
> hdfs-default.xml, though I don't recall if that's an intention= al choice
> or just an oversight.=C2=A0 The value of the property must be expresse= d in
> milliseconds.

This did the trick. Thank you very much. For testing porpuse I'v= e set it
to 10000 and after approx 45s the node was marked as dead.

Any chance to get this into a documented preference so possible behavior changes with future releases can be spotted before staging area.

cheers,
Frank




CONFIDENTIALITY NOTICE
NOTICE: This message is = intended for the use of the individual or entity to which it is addressed a= nd may contain information that is confidential, privileged and exempt from= disclosure under applicable law. If the reader of this message is not the = intended recipient, you are hereby notified that any printing, copying, dis= semination, distribution, disclosure or forwarding of this communication is= strictly prohibited. If you have received this communication in error, ple= ase contact the sender immediately and delete it from your system. Thank Yo= u. --001a11c2ff70e92f57050d92a5a6--