hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
Date Tue, 06 May 2014 05:37:15 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990308#comment-13990308
] 

Ming Ma commented on HADOOP-9640:
---------------------------------

Thanks, Chris.

1. The current approach drops call when RPC queue is full and the client relies on RPC timeout.
It will be interesting to confirm if it is useful to have RPC server throw some exception
back to client and have client do exponential back off; or maybe just block the RPC reader
thread instead.

2. RPC-based approach didn't account for http request such as webHDFS. Based on some test
results, it seems Jetty uses around 250 threads, small compared to the thousands of RPC handler
threads. a) The bad application traffic from webHDFS still has impact on RPC latency, not
as severe compared to the RPC case. b), if there are SLA jobs based on webHDFS, then the RPC
throttling won't help much.

> RPC Congestion Control with FairCallQueue
> -----------------------------------------
>
>                 Key: HADOOP-9640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9640
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Xiaobo Peng
>            Assignee: Chris Li
>              Labels: hdfs, qos, rpc
>         Attachments: FairCallQueue-PerformanceOnCluster.pdf, MinorityMajorityPerformance.pdf,
NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch,
faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, faircallqueue7_with_runtime_swapping.patch,
rpc-congestion-control-draft-plan.pdf
>
>
> Several production Hadoop cluster incidents occurred where the Namenode was overloaded
and failed to respond. 
> We can improve quality of service for users during namenode peak loads by replacing the
FIFO call queue with a [Fair Call Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
(this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was creating
huge number of small files in the user directory. Due to the heavy load on NN, the JT also
was unable to communicate with NN...The cluster became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was overloaded by GetBlockLocation
requests (Correction: should be getFileInfo requests. the job had a bug that called getFileInfo
for a nonexistent file in an endless loop). All other requests to namenode were also affected
by this and hence all jobs slowed down. Cluster almost came to a grinding halt…Eventually
killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on the NameNode,
for e.g. by doing 100k+ 'listStatus' on very large directories (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message