From hdfs-issues-return-265890-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org  Mon Jun  3 23:00:02 2019
Return-Path: <hdfs-issues-return-265890-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 08C1618062F
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  4 Jun 2019 01:00:01 +0200 (CEST)
Received: (qmail 96804 invoked by uid 500); 3 Jun 2019 23:00:01 -0000
Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:hdfs-issues-help@hadoop.apache.org>
List-Unsubscribe: <mailto:hdfs-issues-unsubscribe@hadoop.apache.org>
List-Post: <mailto:hdfs-issues@hadoop.apache.org>
List-Id: <hdfs-issues.hadoop.apache.org>
Delivered-To: mailing list hdfs-issues@hadoop.apache.org
Received: (qmail 96782 invoked by uid 99); 3 Jun 2019 23:00:01 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Jun 2019 23:00:01 +0000
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 9C0B0E002F
	for <hdfs-issues@hadoop.apache.org>; Mon,  3 Jun 2019 23:00:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 533072459B
	for <hdfs-issues@hadoop.apache.org>; Mon,  3 Jun 2019 23:00:00 +0000 (UTC)
Date: Mon, 3 Jun 2019 23:00:00 +0000 (UTC)
From: "Erik Krogen (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.13199411.1542654660000.395484.1559602800338@Atlassian.JIRA>
In-Reply-To: <JIRA.13199411.1542654660000@Atlassian.JIRA>
References: <JIRA.13199411.1542654660000@Atlassian.JIRA> <JIRA.13199411.1542654660031@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HDFS-14090) RBF: Improved isolation for
 downstream name nodes.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/HDFS-14090?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1685=
5129#comment-16855129 ]=20

Erik Krogen commented on HDFS-14090:
------------------------------------

[~crh], I took a look at the design document and think your approach is ver=
y sensible. One issue I considered was that if many clients start posting r=
equests to subcluster A, the call queue on the router may become full of A =
requests thus causing decreased service to subcluster B, but it should cont=
inue to drain quickly as there will still be cluster B handlers available t=
o read the requests and throw {{StandbyException}} to them. So it would see=
m this should not be an issue.

One thing I would prefer to see is an exception used besides {{StandbyExcep=
tion}}; though practically it accomplishes the correct purpose, it is seman=
tically incorrect. Really a backoff exception is closer to the correct sema=
ntics.

> RBF: Improved isolation for downstream name nodes.
> --------------------------------------------------
>
>                 Key: HDFS-14090
>                 URL: https://issues.apache.org/jira/browse/HDFS-14090
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: CR Hota
>            Assignee: CR Hota
>            Priority: Major
>         Attachments: HDFS-14090-HDFS-13891.001.patch, RBF_ Isolation desi=
gn.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, shou=
ld help minimize impact of clients connecting to healthy clusters vs unheal=
thy clusters.
> For example - If there are 2 name nodes downstream, and one of them is he=
avily loaded with calls spiking rpc queue times, due to back pressure the s=
ame with start reflecting on the router. As a result of this, clients conne=
cting to healthy/faster name nodes will also slow down as same rpc queue is=
 maintained for all calls at the router layer. Essentially the same IPC thr=
ead pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss ho=
w we can change the architecture and add some throttling logic for unhealth=
y/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify do=
wnstream name node and maintain a separate queue for each underlying name n=
ode. Another simpler way is to maintain some sort of rate limiter configure=
d for each name node and let routers drop/reject/send error requests after =
certain threshold.=C2=A0
> This won=E2=80=99t be a simple=C2=A0change as router=E2=80=99s =E2=80=98S=
erver=E2=80=99 layer would need redesign and implementation. Currently this=
 layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
> =C2=A0


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org