hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue
Date Tue, 10 Dec 2013 22:20:12 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844743#comment-13844743
] 

Chris Li commented on HADOOP-9640:
----------------------------------

bq. Add a new configuration in common called "hadoop.application.context" to HDFS. Other services
that want to do the same thing can either use this same configuration and find another way
to configure it. This information should be marshalled from the client to the server. The
congestion control can be built based on that.

Just to be clear, would an example be,
1. Cluster operator specifies ipc.8020.application.context = hadoop.yarn
2. Namenode sees this, knows to load the class that generates job IDs from the Connection/Call?

Or were you thinking of physically adding the id into the RPC call itself, which would make
the rpc call size larger, but is a cleaner solution (albeit one that the client could spoof).

bq. Lets also make identities used for accounting configurable. They can be either based on
"context", "user", "token", or "default". That way people who do not like the default configuration
can make changes.

Sounds like a good idea.

> RPC Congestion Control with FairCallQueue
> -----------------------------------------
>
>                 Key: HADOOP-9640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9640
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Xiaobo Peng
>              Labels: hdfs, qos, rpc
>         Attachments: MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf,
faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, faircallqueue5.patch,
rpc-congestion-control-draft-plan.pdf
>
>
> Several production Hadoop cluster incidents occurred where the Namenode was overloaded
and failed to respond. 
> We can improve quality of service for users during namenode peak loads by replacing the
FIFO call queue with a [Fair Call Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf].
(this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was creating
huge number of small files in the user directory. Due to the heavy load on NN, the JT also
was unable to communicate with NN...The cluster became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was overloaded by GetBlockLocation
requests (Correction: should be getFileInfo requests. the job had a bug that called getFileInfo
for a nonexistent file in an endless loop). All other requests to namenode were also affected
by this and hence all jobs slowed down. Cluster almost came to a grinding halt…Eventually
killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on the NameNode,
for e.g. by doing 100k+ 'listStatus' on very large directories (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message