Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0E52310838 for ; Mon, 16 Dec 2013 19:32:10 +0000 (UTC) Received: (qmail 9440 invoked by uid 500); 16 Dec 2013 19:32:09 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 9405 invoked by uid 500); 16 Dec 2013 19:32:09 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 9396 invoked by uid 99); 16 Dec 2013 19:32:09 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Dec 2013 19:32:09 +0000 Date: Mon, 16 Dec 2013 19:32:09 +0000 (UTC) From: "Chris Li (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HADOOP-9640) RPC Congestion Control with FairCallQueue MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-9640?page=3Dcom.atlassi= an.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Li updated HADOOP-9640: ----------------------------- Attachment: faircallqueue6.patch Uploaded new patch that adds configurable Call identity used for scheduling= . Config: ipc.8020.call.identity =3D USER or GROUP In the future, this can be extended with more options > RPC Congestion Control with FairCallQueue > ----------------------------------------- > > Key: HADOOP-9640 > URL: https://issues.apache.org/jira/browse/HADOOP-9640 > Project: Hadoop Common > Issue Type: Improvement > Affects Versions: 3.0.0, 2.2.0 > Reporter: Xiaobo Peng > Labels: hdfs, qos, rpc > Attachments: MinorityMajorityPerformance.pdf, NN-denial-of-servic= e-updated-plan.pdf, faircallqueue.patch, faircallqueue2.patch, faircallqueu= e3.patch, faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch,= rpc-congestion-control-draft-plan.pdf > > > Several production Hadoop cluster incidents occurred where the Namenode w= as overloaded and failed to respond.=20 > We can improve quality of service for users during namenode peak loads by= replacing the FIFO call queue with a [Fair Call Queue|https://issues.apach= e.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf= ]. (this plan supersedes rpc-congestion-control-draft-plan). > Excerpted from the communication of one incident, =E2=80=9CThe map task o= f a user was creating huge number of small files in the user directory. Due= to the heavy load on NN, the JT also was unable to communicate with NN...T= he cluster became responsive only once the job was killed.=E2=80=9D > Excerpted from the communication of another incident, =E2=80=9CNamenode w= as overloaded by GetBlockLocation requests (Correction: should be getFileIn= fo requests. the job had a bug that called getFileInfo for a nonexistent fi= le in an endless loop). All other requests to namenode were also affected b= y this and hence all jobs slowed down. Cluster almost came to a grinding ha= lt=E2=80=A6Eventually killed jobtracker to kill all jobs that are running.= =E2=80=9D > Excerpted from HDFS-945, =E2=80=9CWe've seen defective applications cause= havoc on the NameNode, for e.g. by doing 100k+ 'listStatus' on very large = directories (60k files) etc.=E2=80=9D -- This message was sent by Atlassian JIRA (v6.1.4#6159)