Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0E60B10D59 for ; Tue, 10 Dec 2013 22:20:13 +0000 (UTC) Received: (qmail 93009 invoked by uid 500); 10 Dec 2013 22:20:12 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 92983 invoked by uid 500); 10 Dec 2013 22:20:12 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 92974 invoked by uid 99); 10 Dec 2013 22:20:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Dec 2013 22:20:12 +0000 Date: Tue, 10 Dec 2013 22:20:12 +0000 (UTC) From: "Chris Li (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-9640?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D138= 44743#comment-13844743 ]=20 Chris Li commented on HADOOP-9640: ---------------------------------- bq. Add a new configuration in common called "hadoop.application.context" t= o HDFS. Other services that want to do the same thing can either use this s= ame configuration and find another way to configure it. This information sh= ould be marshalled from the client to the server. The congestion control ca= n be built based on that. Just to be clear, would an example be, 1. Cluster operator specifies ipc.8020.application.context =3D hadoop.yarn 2. Namenode sees this, knows to load the class that generates job IDs from = the Connection/Call? Or were you thinking of physically adding the id into the RPC call itself, = which would make the rpc call size larger, but is a cleaner solution (albei= t one that the client could spoof). bq. Lets also make identities used for accounting configurable. They can be= either based on "context", "user", "token", or "default". That way people = who do not like the default configuration can make changes. Sounds like a good idea. > RPC Congestion Control with FairCallQueue > ----------------------------------------- > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement > Affects Versions: 3.0.0, 2.2.0 > Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, NN-denial-of-servic= e-updated-plan.pdf, faircallqueue.patch, faircallqueue2.patch, faircallqueu= e3.patch, faircallqueue4.patch, faircallqueue5.patch, rpc-congestion-contro= l-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode w= as overloaded and failed to respond.=20 > We can improve quality of service for users during namenode peak loads by= replacing the FIFO call queue with a [Fair Call Queue|https://issues.apach= e.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf= ]. (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, =E2=80=9CThe map task o= f a user was creating huge number of small files in the user directory. Due= to the heavy load on NN, the JT also was unable to communicate with NN...T= he cluster became responsive only once the job was killed.=E2=80=9D > Excerpted from the communication of another incident, =E2=80=9CNamenode w= as overloaded by GetBlockLocation requests (Correction: should be getFileIn= fo requests. the job had a bug that called getFileInfo for a nonexistent fi= le in an endless loop). All other requests to namenode were also affected b= y this and hence all jobs slowed down. Cluster almost came to a grinding ha= lt=E2=80=A6Eventually killed jobtracker to kill all jobs that are running.= =E2=80=9D > Excerpted from HDFS-945, =E2=80=9CWe've seen defective applications cause= havoc on the NameNode, for e.g. by doing 100k+ 'listStatus' on very large = directories (60k files) etc.=E2=80=9D -- This message was sent by Atlassian JIRA (v6.1.4#6159)