hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sanjay Radia (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-599) Improve Namenode robustness by prioritizing datanode heartbeats over client requests
Date Tue, 18 May 2010 17:20:49 GMT

    [ https://issues.apache.org/jira/browse/HDFS-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868745#action_12868745
] 

Sanjay Radia commented on HDFS-599:
-----------------------------------

Having all protocols serve  on all ports is strange and not very standard practice. 
However I do agree with the use cases that during startup or during a period of high load
on the NN, an Admin may want to issues a  standard  NN operation and ensure that it gets served
promptly and perhaps with priority.

I agree with Dhruba that breaking the client protocol in 2 parts is questionable and IMHO
 architecturally not clean (imagine explaining to someone why we split the client protocol
into 2 parts).

There are two solution here. One is to give priority to certain users (this is very complex
and I don't recommend doing it). The other is to extend Hadoop's existing Service ACL: The
service ACL specifies the protocols and the list of users and groups that are allowed to access
the protocol. I suggest a separate  jira to extend the Service ACL to optionally specify a
port in addition to the protocol name. Dmytro, I request that you also complete this other
Jira independently in the spirit of providing a clean comprehensive solution to the problem
of multiple protocols on multiple ports.



> Improve Namenode robustness by prioritizing datanode heartbeats over client requests
> ------------------------------------------------------------------------------------
>
>                 Key: HDFS-599
>                 URL: https://issues.apache.org/jira/browse/HDFS-599
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>            Reporter: dhruba borthakur
>            Assignee: Dmytro Molkov
>         Attachments: HDFS-599.patch
>
>
> The namenode processes RPC requests from clients that are reading/writing to files as
well as heartbeats/block reports from datanodes.
> Sometime, because of various reasons (Java GC runs, inconsistent performance of NFS filer
that stores HDFS transacttion logs, etc), the namenode encounters transient slowness. For
example, if the device that stores the HDFS transaction logs becomes sluggish, the Namenode's
ability to process RPCs slows down to a certain extent. During this time, the RPCs from clients
as well as the RPCs from datanodes suffer in similar fashion. If the underlying problem becomes
worse, the NN's ability to process a heartbeat from a DN is severly impacted, thus causing
the NN to declare that the DN is dead. Then the NN starts replicating blocks that used to
reside on the now-declared-dead datanode. This adds extra load to the NN. Then the now-declared-datanode
finally re-establishes contact with the NN, and sends a block report. The block report processing
on the NN is another heavyweight activity, thus casing more load to the already overloaded
namenode. 
> My proposal is tha the NN should try its best to continue processing RPCs from datanodes
and give lesser priority to serving client requests. The Datanode RPCs are integral to the
consistency and performance of the Hadoop file system, and it is better to protect it at all
costs. This will ensure that NN  recovers from the hiccup much faster than what it does now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message