From common-issues-return-180952-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Thu Aug 1 10:23:02 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 7C8E1180651 for ; Thu, 1 Aug 2019 12:23:02 +0200 (CEST) Received: (qmail 19956 invoked by uid 500); 1 Aug 2019 10:23:01 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 19845 invoked by uid 99); 1 Aug 2019 10:23:01 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Aug 2019 10:23:01 +0000 Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 5DCA8E0E2E for ; Thu, 1 Aug 2019 10:23:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 197A526636 for ; Thu, 1 Aug 2019 10:23:00 +0000 (UTC) Date: Thu, 1 Aug 2019 10:23:00 +0000 (UTC) From: "Jinglun (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-16403?page=3Dcom.atlassi= an.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D16= 897945#comment-16897945 ]=20 Jinglun commented on HADOOP-16403: ---------------------------------- About shadedclient error, I searched [patch-shadedclient.txt|https://builds= .apache.org/job/PreCommit-HADOOP-Build/16437/artifact/out/patch-shadedclien= t.txt]=C2=A0and found this: {quote}[ERROR] Found artifact with unexpected contents: '/testptch/hadoop/h= adoop-client-modules/hadoop-client-api/target/hadoop-client-api-3.3.0-SNAPS= HOT.jar' Please check the following and either correct the build or update the allowed list with reasoning. core-default.xml.orig {quote} There is a jar check in *_./hadoop-client-modules/hadoop-client-check-invar= iants/src/test/resources/ensure-jars-have-correct-contents.sh_*, seems core= -default.xml.orig is packaged into=C2=A0hadoop-client-api-3.3.0-SNAPSHOT.ja= r.=C2=A0 I'm not sure how does this happen. I make a new patch from the latest trunk= and fix the check styles. Upload patch-005 see if the shadedclient error s= till occurs. =C2=A0 =C2=A0 > Start a new statistical rpc queue and make the Reader's pendingConnection= queue runtime-replaceable > -------------------------------------------------------------------------= -------------------------- > > Key: HADOOP-16403 > URL: https://issues.apache.org/jira/browse/HADOOP-16403 > Project: Hadoop Common > Issue Type: Improvement > Reporter: Jinglun > Assignee: Jinglun > Priority: Major > Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf= , HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, H= ADOOP-16403.004.patch, MetricLinkedBlockingQueueTest.pdf > > > I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big s= o after the active dead, it takes the standby more than 40s to become activ= e. Many requests(tcp connect request and rpc request) from Datanodes, clien= ts and zkfc timed out and start retrying. The suddenly request flood lasts = for the next 2 minutes and finally all requests are either handled or run o= ut of retry times.=20 > Adjusting the rpc related settings might power the NameNode and solve th= is problem and the key point is finding the bottle neck. The rpc server can= be described as below: > {noformat} > Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat} > By sampling some failed clients, I find many of them got ConnectTimeoutEx= ception. It's caused by a 20s un-responded tcp connect request. I think may= be the reader queue is full and block the listener from handling new conne= ctions. Both slow handlers and slow readers can block the whole processing = progress, and I need to know who it is. I think *a queue that computes the = qps, write log when the queue is full and could be replaced easily* will he= lp.=20 > I find the nice work HADOOP-10302 implementing a runtime-swapped queue. = Using it at Reader's queue makes the reader queue runtime-swapped automatic= ally. The qps computing job could be done by implementing a subclass of Lin= kedBlockQueue that does the computing job while put/take/... happens. The q= ps data will show on jmx. > =C2=A0 > =C2=A0 -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org