hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
Date Sat, 25 Jan 2014 00:27:39 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881556#comment-13881556
] 

Daryn Sharp commented on HADOOP-9640:
-------------------------------------

Agreed, this needs subtasks.  General comments/requests:
# Please make the default callq a {{BlockingQueue}} again, and have your custom implementations
conform to the interface.
# The default callq should remain a {{LinkedBlockingQueue}}, not a {{FIFOCallQueue}}.  You're
doing some pretty tricky locking and I'd rather trust the JDK.
# Call.getRemoteUser() would be much cleaner to get the UGI than an interface + enum to get
user and group.
# Using the literal string "unknown!" for a user or group is not a good idea.

The more I think about it, multiple queues will exasperate congestion problem as Kihwal points
out.  For that reason, I'd like to see minimal invasiveness in the Server class - I'll feel
safe and you are free to experiment with alternate implementations.

> RPC Congestion Control with FairCallQueue
> -----------------------------------------
>
>                 Key: HADOOP-9640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9640
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Xiaobo Peng
>              Labels: hdfs, qos, rpc
>         Attachments: MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf,
faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, faircallqueue5.patch,
faircallqueue6.patch, faircallqueue7_with_runtime_swapping.patch, rpc-congestion-control-draft-plan.pdf
>
>
> Several production Hadoop cluster incidents occurred where the Namenode was overloaded
and failed to respond. 
> We can improve quality of service for users during namenode peak loads by replacing the
FIFO call queue with a [Fair Call Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
(this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was creating
huge number of small files in the user directory. Due to the heavy load on NN, the JT also
was unable to communicate with NN...The cluster became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was overloaded by GetBlockLocation
requests (Correction: should be getFileInfo requests. the job had a bug that called getFileInfo
for a nonexistent file in an endless loop). All other requests to namenode were also affected
by this and hence all jobs slowed down. Cluster almost came to a grinding halt…Eventually
killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on the NameNode,
for e.g. by doing 100k+ 'listStatus' on very large directories (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message