hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6180) dead node count / listing is very broken in JMX and old GUI
Date Thu, 17 Apr 2014 22:33:15 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973513#comment-13973513
] 

Colin Patrick McCabe commented on HDFS-6180:
--------------------------------------------

Hi Haohui,

See the discussion at HDFS-5237 for some background.  Basically, there is this configuration
called {{dfs.datanode.hostname}} which specifies a datanode's "registration name."  This may
be different from the first hostname you get by doing a reverse lookup on the DataNode's IP
address.

That's why DatanodeID has three fields instead of two:
{code}
public class DatanodeID implements Comparable<DatanodeID> {
  public static final DatanodeID[] EMPTY_ARRAY = {};

  private String ipAddr;     // IP address
  private String hostName;   // hostname claimed by datanode
  private String peerHostName; // hostname from the actual connection
{code}

The field named {{hostName}} is actually not the hostname, but the "registration name," which
is what the datanode was configured to say its name was, via {{dfs.datanode.hostname}}.  {{peerHostName}}
is the hostname you get by doing a reverse DNS lookup on {{ipAddr}}.

Part of the use for registration names is in unit tests, where creating a new hostname is
not practical.  Another use is in dealing with multi-homing setups.

bq. The reason why I removed this test is that -registration-name- is not a valid DNS name.

The point of the test was to ensure that we could specify registration names in the exclude
and include files and have them work.  We should make sure that this functionality is still
working.

This is a real problem for some people.  For example, consider if you have an AWS instance
with an external and internal hostname.  You might configure your DNs to use {{dn1.internal.host.name}}
(or whatever) rather than {{dn1.external.host.name}}.  This avoids the issue where the NN
does a reverse DNS lookup on the IP, and comes up with {{dn1.external.host.name}}, and starts
sending traffic over the wrong interface.  This sort of thing is very important on AWS, because
people are actually charged money for sending traffic to the external hostname (rather than
internal).

If you like, the test could be configured to use a valid but non-default loopback IP (such
as 127.0.5.1) rather than an invalid string.  But in any case, I think we need a JIRA to restore
it.  Will file one shortly.

> dead node count / listing is very broken in JMX and old GUI
> -----------------------------------------------------------
>
>                 Key: HDFS-6180
>                 URL: https://issues.apache.org/jira/browse/HDFS-6180
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Travis Thompson
>            Assignee: Haohui Mai
>            Priority: Blocker
>             Fix For: 2.5.0
>
>         Attachments: HDFS-6180.000.patch, HDFS-6180.001.patch, HDFS-6180.002.patch, HDFS-6180.003.patch,
HDFS-6180.004.patch, dn.log
>
>
> After bringing up a 578 node cluster with 13 dead nodes, 0 were reported on the new GUI,
but showed up properly in the datanodes tab.  Some nodes are also being double reported in
the deadnode and inservice section (22 show up dead, 565 show up alive, 9 duplicated nodes).

> From /jmx (confirmed that it's the same in jconsole):
> {noformat}
> {
>     "name" : "Hadoop:service=NameNode,name=FSNamesystemState",
>     "modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem",
>     "CapacityTotal" : 5477748687372288,
>     "CapacityUsed" : 24825720407,
>     "CapacityRemaining" : 5477723861651881,
>     "TotalLoad" : 565,
>     "SnapshotStats" : "{\"SnapshottableDirectories\":0,\"Snapshots\":0}",
>     "BlocksTotal" : 21065,
>     "MaxObjects" : 0,
>     "FilesTotal" : 25454,
>     "PendingReplicationBlocks" : 0,
>     "UnderReplicatedBlocks" : 0,
>     "ScheduledReplicationBlocks" : 0,
>     "FSState" : "Operational",
>     "NumLiveDataNodes" : 565,
>     "NumDeadDataNodes" : 0,
>     "NumDecomLiveDataNodes" : 0,
>     "NumDecomDeadDataNodes" : 0,
>     "NumDecommissioningDataNodes" : 0,
>     "NumStaleDataNodes" : 1
>   },
> {noformat}
> I'm not going to include deadnode/livenodes because the list is huge, but I've confirmed
there are 9 nodes showing up in both deadnodes and livenodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message