hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiao Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14044) Synchronization issue in delegation token cancel functionality
Date Thu, 02 Feb 2017 22:54:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850688#comment-15850688

Xiao Chen commented on HADOOP-14044:

Thanks for the patch [~hgadre] and the manual tests, looks pretty good to me. I agree it's
not worth to write a unit test, since that would introduce a whole lot of test controls (or
too many mocks....).

Suggest to add {{synchronized}} keyword to the new {{syncLocalCacheWithZkState}}, since we're
handling {{currentTokens}} there. I understand the current code is fine because the caller
method {{cancelToken}} is synchronized, but adding it would be more future proof.

And a super trivial nit: I think we can just name the new method {{syncLocalCacheWithZk}}.

> Synchronization issue in delegation token cancel functionality
> --------------------------------------------------------------
>                 Key: HADOOP-14044
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14044
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Hrishikesh Gadre
>            Assignee: Hrishikesh Gadre
>         Attachments: dt_fail.log, dt_success.log, HADOOP-14044-001.patch, HADOOP-14044-002.patch
> We are using Hadoop delegation token authentication functionality in Apache Solr. As
part of the integration testing, I found following issue with the delegation token cancelation
> Consider a setup with 2 Solr servers (S1 and S2) which are configured to use delegation
token functionality backed by Zookeeper. Now invoke following steps,
> [Step 1] Send a request to S1 to create a delegation token.
>   (Delegation token DT is created successfully)
> [Step 2] Send a request to cancel DT to S2
>   (DT is canceled successfully. client receives HTTP 200 response)
> [Step 3] Send a request to cancel DT to S2 again
>   (DT cancelation fails. client receives HTTP 404 response)
> [Step 4] Send a request to cancel DT to S1
> At this point we get two different responses.
> - DT cancelation fails. client receives HTTP 404 response
> - DT cancelation succeeds. client receives HTTP 200 response
> Also as per the current implementation, each server maintains an in_memory cache of current
tokens which is updated using the ZK watch mechanism. e.g. the ZK watch on S1 will ensure
that the in_memory cache is synchronized after step 2.
> After investigation, I found the root cause for this behavior is due to the race condition
between step 4 and the firing of ZK watch on S1. Whenever the watch fires before the step
4 - we get HTTP 404 response (as expected). When that is not the case - we get HTTP 200 response
along with following ERROR message in the log,
> {noformat}
> Attempted to remove a non-existing znode /ZKDTSMTokensRoot/DT_XYZ
> {noformat}
> From client perspective, the server *should* return HTTP 404 error when the cancel request
is sent out for an invalid token.
> Ref: Here is the relevant Solr unit test for reference,
> https://github.com/apache/lucene-solr/blob/746786636404cdb8ce505ed0ed02b8d9144ab6c4/solr/core/src/test/org/apache/solr/cloud/TestSolrCloudWithDelegationTokens.java#L285

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message