Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id AB7D4200CED for ; Fri, 18 Aug 2017 18:56:46 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8CB4716D0B9; Fri, 18 Aug 2017 16:56:46 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8532D16D0B5 for ; Fri, 18 Aug 2017 18:56:45 +0200 (CEST) Received: (qmail 29605 invoked by uid 500); 18 Aug 2017 16:56:44 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 29593 invoked by uid 99); 18 Aug 2017 16:56:43 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Aug 2017 16:56:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 5B20F1806F3 for ; Fri, 18 Aug 2017 16:56:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.779 X-Spam-Level: * X-Spam-Status: No, score=1.779 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id UXfDH7aiJhbT for ; Fri, 18 Aug 2017 16:56:40 +0000 (UTC) Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id CB4465F19C for ; Fri, 18 Aug 2017 16:56:39 +0000 (UTC) Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id v7IGrjmZ142422 for ; Fri, 18 Aug 2017 12:56:38 -0400 Received: from smtp.notes.na.collabserv.com (smtp.notes.na.collabserv.com [192.155.248.72]) by mx0b-001b2d01.pphosted.com with ESMTP id 2ce2qun7wm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Fri, 18 Aug 2017 12:56:38 -0400 Received: from localhost by smtp.notes.na.collabserv.com with smtp.notes.na.collabserv.com ESMTP for from ; Fri, 18 Aug 2017 16:56:37 -0000 Received: from us1a3-smtp02.a3.dal06.isc4sb.com (10.106.154.159) by smtp.notes.na.collabserv.com (10.106.227.158) with smtp.notes.na.collabserv.com ESMTP; Fri, 18 Aug 2017 16:56:36 -0000 Received: from us1a3-mail107.a3.dal06.isc4sb.com ([10.146.45.243]) by us1a3-smtp02.a3.dal06.isc4sb.com with ESMTP id 2017081816563539-698852 ; Fri, 18 Aug 2017 16:56:35 +0000 In-Reply-To: To: dev@kafka.apache.org Subject: Re: [DISCUSS] KIP-152 - Improve diagnostics for SASL authentication failures From: "Vahid S Hashemian" Date: Fri, 18 Aug 2017 09:56:36 -0700 References: MIME-Version: 1.0 X-KeepSent: 7DDA56F7:4EF1C87E-00258180:005CEBA9; type=4; name=$KeepSent X-Mailer: IBM Notes Release 9.0.1EXT SHF766 December 14, 2016 X-LLNOutbound: False X-Disclaimed: 13123 X-TNEFEvaluated: 1 Content-Type: multipart/alternative; boundary="=_alternative 005D11B288258180_=" x-cbid: 17081816-6059-0000-0000-000005528F9D X-IBM-SpamModules-Scores: BY=0.041508; FL=0; FP=0; FZ=0; HX=0; KW=0; PH=0; SC=0.4332; ST=0; TS=0; UL=0; ISC=; MB=0.377861 X-IBM-SpamModules-Versions: BY=3.00007568; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000223; SDB=6.00904182; UDB=6.00453022; IPR=6.00684394; BA=6.00005539; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016756; XFM=3.00000015; UTC=2017-08-18 16:56:37 X-IBM-AV-DETECTION: SAVI=unsuspicious REMOTE=unsuspicious XFE=unused X-IBM-AV-VERSION: SAVI=2017-08-18 14:15:00 - 6.00007198 x-cbparentid: 17081816-6060-0000-0000-00005A4EB51F Message-Id: X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-08-18_07:,, signatures=0 X-Proofpoint-Spam-Reason: safe archived-at: Fri, 18 Aug 2017 16:56:46 -0000 --=_alternative 005D11B288258180_= Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="US-ASCII" Hi Rajini, Thanks for the KIP. It looks good to me. It would be great if it can make it to 1.0.0. --Vahid From: Rajini Sivaram To: dev Date: 08/15/2017 01:23 PM Subject: Re: [DISCUSS] KIP-152 - Improve diagnostics for SASL=20 authentication failures I have updated the KIP based on the discussions so far. It will be good if we can get some more feedback so that this can be implemented for 1.0.0. Thanks, Rajini On Thu, May 4, 2017 at 10:22 PM, Ismael Juma wrote: > Hi Rajini, > > I think we were talking about slightly different things. I was just > referring to the fact that there are cases where we throw an > AuthorizationException back to the user without retrying from various > methods (poll, commitSync, etc). > > As you said, my initial preference was for not retrying at all because=20 it > is what you want in the common case of a misconfigured application. I > hadn't considered credential updates for authenticators that rely on > eventual consistency. Thinking about it some more, it seems like this > should be solved by the authenticator implementation as well. For=20 example, > it could refresh the cached data for a user if authentication failed (a > good implementation would be a bit more involved to avoid going to the > underlying data source too often). > > Given that, not retrying sounds good to me. > > Ismael > > On Thu, May 4, 2017 at 4:04 PM, Rajini Sivaram > wrote: > > > Hi Ismael, > > > > I thought the blocking waits in the producer and consumer are always > > related to retrying for metadata. So an authorization exception that > > impacts this wait can only be due to Describe authorization failure - > that > > always retries? > > > > I agree that connecting to different brokers when authentication fails > with > > one is not desirable. But I am not keen on retrying with a suitable > backoff > > until timeout either. Because that has the same problem as the=20 scenario > > that you described. The next metadata request could be to broker-1 to > which > > authentication succeeds and subsequent produce/consume to broker-0=20 could > > still fail. > > > > How about we just fail fast if one authentication fails - I think that = is > > what you were suggesting in the first place? We don't need to blackout > any > > nodes beyond the reconnect backoff interval. Applications can still=20 retry > > if they want to. In the case of credential updates, it will be up to=20 the > > application to retry. During regular operation, a misconfigured > application > > fails fast with a meaningful exception. What do you think? > > > > Regards, > > > > Rajini > > > > > > On Thu, May 4, 2017 at 3:01 PM, Ismael Juma wrote: > > > > > H Rajini, > > > > > > Comments inline. > > > > > > On Thu, May 4, 2017 at 2:29 PM, Rajini Sivaram < > rajinisivaram@gmail.com> > > > wrote: > > > > > > > Hi Ismael, > > > > > > > > Thank you for reviewing the KIP. > > > > > > > > An authenticated client that is not authorized to access a topic=20 is > > never > > > > told that the operation was not authorized. This is to prevent the > > client > > > > from finding out if the topic exists by sending an unauthorized > > request. > > > So > > > > in this case, the client will retry metadata requests with the > > configured > > > > backoff until it times out. > > > > > > > > > This is true if the user does not have Describe permission. If the=20 user > > has > > > Describe access and no Read or Write access, then the user is=20 informed > > that > > > the operation was not authorized. > > > > > > > > > > Another important distinction for authorization failures is that=20 the > > > > connection is not terminated. > > > > > > > > For unauthenticated clients, we do want to inform the client that > > > > authentication failed. The connection is terminated by the broker. > > > > Especially if the client is using SASL=5FSSL, we really do want to > avoid > > > > reconnections that result in unnecessary expensive handshakes. So=20 we > > want > > > > to return an exception to the user with minimal retries. > > > > > > > > > > Agreed. > > > > > > I was thinking that it may be useful to try more than one broker for > the > > > > case where brokers are being upgraded and some brokers haven't yet > seen > > > the > > > > latest credentials. I suppose I was thinking that at the moment we > keep > > > on > > > > retrying every broker forever in the consumer and suddenly if we=20 stop > > > > retrying altogether, it could potentially lead to some unforeseen > > timing > > > > issues. Hence the suggestion to try every broker once. > > > > > > > > > > I see. Retrying forever is a side-effect of auto-topic creation, but > it's > > > something we want to move away from. As mentioned, we actually don't > > retry > > > at all if the user has Describe permission. > > > > > > Broker upgrades could be fixed by ensuring that the latest=20 credentials > > are > > > loaded before the broker starts serving requests. More problematic=20 is > > > dealing with credential updates. This is another distinction when > > compared > > > to authorization. > > > > > > I am not sure if trying different brokers really helps us though.=20 Say, > we > > > fail to authenticate with broker 0 and then we succeed with broker=20 1. > > This > > > helps with metadata requests, but we will be in trouble when we try=20 to > > > produce or consume to broker 0 (because it's the leader of some > > > partitions). So maybe we just want to retry with a suitable backoff > > until a > > > timeout? > > > > > > Yes, I agree that blacking out nodes forever isn't a good idea. When = we > > > > throw AuthenticationFailedException for the current operation or=20 if > > > > authentication to another broker succeeds, we can clear the=20 blackout > so > > > > that any new request from the client can attempt reconnection=20 after > the > > > > reconnect backoff period as they do now. > > > > > > > > > > Yes, that would be better if we decide that connecting to different > > brokers > > > is worthwhile for the requests that can be sent to any broker. > > > > > > Ismael > > > > > > On Thu, May 4, 2017 at 2:29 PM, Rajini Sivaram < > rajinisivaram@gmail.com> > > > wrote: > > > > > > > Hi Ismael, > > > > > > > > Thank you for reviewing the KIP. > > > > > > > > An authenticated client that is not authorized to access a topic=20 is > > never > > > > told that the operation was not authorized. This is to prevent the > > client > > > > from finding out if the topic exists by sending an unauthorized > > request. > > > So > > > > in this case, the client will retry metadata requests with the > > configured > > > > backoff until it times out. Another important distinction for > > > authorization > > > > failures is that the connection is not terminated. > > > > > > > > For unauthenticated clients, we do want to inform the client that > > > > authentication failed. The connection is terminated by the broker. > > > > Especially if the client is using SASL=5FSSL, we really do want to > avoid > > > > reconnections that result in unnecessary expensive handshakes. So=20 we > > want > > > > to return an exception to the user with minimal retries. > > > > > > > > I was thinking that it may be useful to try more than one broker=20 for > > the > > > > case where brokers are being upgraded and some brokers haven't yet > seen > > > the > > > > latest credentials. I suppose I was thinking that at the moment we > keep > > > on > > > > retrying every broker forever in the consumer and suddenly if we=20 stop > > > > retrying altogether, it could potentially lead to some unforeseen > > timing > > > > issues. Hence the suggestion to try every broker once. > > > > > > > > Yes, I agree that blacking out nodes forever isn't a good idea.=20 When > we > > > > throw AuthenticationFailedException for the current operation or=20 if > > > > authentication to another broker succeeds, we can clear the=20 blackout > so > > > > that any new request from the client can attempt reconnection=20 after > the > > > > reconnect backoff period as they do now. > > > > > > > > Regards, > > > > > > > > Rajini > > > > > > > > On Thu, May 4, 2017 at 12:51 PM, Ismael Juma > > wrote: > > > > > > > > > Thanks Rajini. This is a good improvement. One question, the > proposal > > > > > states: > > > > > > > > > > Producer waitForMetadata and consumer ensureCoordinatorReady=20 will > be > > > > > > updated to throw AuthenticationFailedException if connections=20 to > > all > > > > > > available brokers fail authentication. > > > > > > > > > > > > > > > Can you elaborate on the reason why we would treat=20 authentication > > > > failures > > > > > differently from authorization failures? It would be good to > > understand > > > > > under which scenario it would be beneficial to try all the=20 brokers > > (it > > > > > seems that the proposal also suggests blacking out brokers > > permanently > > > if > > > > > we fail authentication, so that could also eventually cause > issues). > > > > > > > > > > Ismael > > > > > > > > > > > > > > > On Thu, May 4, 2017 at 12:37 PM, Rajini Sivaram < > > > rajinisivaram@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > I have created a KIP to improve diagnostics for SASL > authentication > > > > > > failures and reduce retries and blocking when authentication > fails: > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-152+-+ > > > > > > Improve+diagnostics+for+SASL+authentication+failures > > > > > > > > > > > > Comments and suggestions are welcome. > > > > > > > > > > > > Thank you... > > > > > > > > > > > > Regards, > > > > > > > > > > > > Rajini > > > > > > > > > > > > > > > > > > > > > --=_alternative 005D11B288258180_=--