Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-issues@hadoop.apache.org
Date: Tue, 10 Dec 2013 22:20:12 +0000 (UTC)
From: "Chris Li (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Message-ID: <JIRA.12652288.1370989636287.12647.1386714012293@arcas>
In-Reply-To: <JIRA.12652288.1370989636287@arcas>
References: <JIRA.12652288.1370989636287@arcas>
Subject: [jira] [Commented] (HADOOP-9640) RPC Congestion Control with
 FairCallQueue
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/HADOOP-9640?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D138=
44743#comment-13844743 ]=20

Chris Li commented on HADOOP-9640:
----------------------------------

bq. Add a new configuration in common called "hadoop.application.context" t=
o HDFS. Other services that want to do the same thing can either use this s=
ame configuration and find another way to configure it. This information sh=
ould be marshalled from the client to the server. The congestion control ca=
n be built based on that.

Just to be clear, would an example be,
1. Cluster operator specifies ipc.8020.application.context =3D hadoop.yarn
2. Namenode sees this, knows to load the class that generates job IDs from =
the Connection/Call?

Or were you thinking of physically adding the id into the RPC call itself, =
which would make the rpc call size larger, but is a cleaner solution (albei=
t one that the client could spoof).

bq. Lets also make identities used for accounting configurable. They can be=
 either based on "context", "user", "token", or "default". That way people =
who do not like the default configuration can make changes.

Sounds like a good idea.

> RPC Congestion Control with FairCallQueue
> -----------------------------------------
>
>                 Key: HADOOP-9640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9640
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Xiaobo Peng
>              Labels: hdfs, qos, rpc
>         Attachments: MinorityMajorityPerformance.pdf, NN-denial-of-servic=
e-updated-plan.pdf, faircallqueue.patch, faircallqueue2.patch, faircallqueu=
e3.patch, faircallqueue4.patch, faircallqueue5.patch, rpc-congestion-contro=
l-draft-plan.pdf
>
>
> Several production Hadoop cluster incidents occurred where the Namenode w=
as overloaded and failed to respond.=20
> We can improve quality of service for users during namenode peak loads by=
 replacing the FIFO call queue with a [Fair Call Queue|https://issues.apach=
e.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf=
]. (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, =E2=80=9CThe map task o=
f a user was creating huge number of small files in the user directory. Due=
 to the heavy load on NN, the JT also was unable to communicate with NN...T=
he cluster became responsive only once the job was killed.=E2=80=9D
> Excerpted from the communication of another incident, =E2=80=9CNamenode w=
as overloaded by GetBlockLocation requests (Correction: should be getFileIn=
fo requests. the job had a bug that called getFileInfo for a nonexistent fi=
le in an endless loop). All other requests to namenode were also affected b=
y this and hence all jobs slowed down. Cluster almost came to a grinding ha=
lt=E2=80=A6Eventually killed jobtracker to kill all jobs that are running.=
=E2=80=9D
> Excerpted from HDFS-945, =E2=80=9CWe've seen defective applications cause=
 havoc on the NameNode, for e.g. by doing 100k+ 'listStatus' on very large =
directories (60k files) etc.=E2=80=9D


--
This message was sent by Atlassian JIRA
(v6.1.4#6159)