hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7380) Add client failover functionality to o.a.h.io.(ipc|retry)
Date Wed, 17 Aug 2011 19:01:28 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086517#comment-13086517
] 

Aaron T. Myers commented on HADOOP-7380:
----------------------------------------

As requested, here's a design doc:

h2. Client Retry/Fail Over in IPC Overview

The goal of HDFS-1973 is to provide a facility for clients to fail over and retry an RPC to
a different NN in the event of failure, in support of HDFS HA. Since the HDFS RPC mechanisms
are built on top of the Common IPC library, this is a natural place to implement this functionality.
Furthermore, since client fail over in general has challenges which are not HDFS-specific,
we may be able to reuse this mechanism for other HA services in the future (e.g. perhaps the
MRv2 Resource Manager.)

h3. Goals

This mechanism should be able to support the following:

# A way to support multiple distinct ways for determining which object an RPC should be attempted
against.
# A way of specifying customized failover strategies. For example, some HA service may only
support a pair of machines to serve requests, while another might support an arbitrary number.
The strategy for failing over to a new machine might be different for these cases.
# The method for specifying a retry/failover strategy should be able to control both retry
and failover logic. For example, a strategy may want to try a request to server A once, attempt
a failover to server B immediately, fail back to server A and then retry several times with
some amount of backoff.
# A way for a remote process to indicate that it is not the appropriate process to serve the
request, and that an attempt should be made to another process.
# A way of specifying that some operations in an IPC interface are not safe to be failed over
and retried.

h3. Approach

The Common IPC library already supports retrying an IPC against the same remote process. This
is done via the classes in the o.a.h.io.retry package. The exact desired retry semantics can
be specified using an implementation of the {{RetryPolicy}} interface. An implementation of
this interface is responsible for determining whether or not a particular method call should
be retried given the number of times its already been tried and the particular exception which
caused the method to fail. An implementer of this interface can either choose to have the
failed method retried, or the exception re-thrown.

In a sense, client-side failover is exactly the same as the existing retry mechanism, except
method calls can be retried against a different proxy object. Thus, this JIRA proposes to
extend the existing retry facility to also support client failover.

Presently, the {{RetryPolicy.shouldRetry}} method only returns a boolean to indicate whether
a method invocation should be retried or considered to have failed. This can be augmented
to return an enum value to indicate whether a particular method invocation should fail, be
retried against the same object, or retried on another object. In order to determine what
object a method should be tried against, this JIRA introduces the concept of a {{FailoverProxyProvider}}.
An implementer of this interface is capable of getting a proxy object and initiating a client
failover when the result of {{RetryPolicy.shouldRetry}} indicates a failover should be performed
for the particular {{RetryPolicy}} the {{RetryInvocationHandler}} is configured to use. This
addresses goals 1, 2, and 3 from above.

To address goal 4, this JIRA introduces a new exception type - {{StandbyException}} - which
can be thrown by remote processes to indicate that it is not the appropriate process to handle
requests for a given service at this time. {{RetryPolicy}} implementations may choose to handle
this exception differently than other exception types when determining whether or not to retry
or fail over an operation.

Though there may be circumstances in which a client may desire a more complex retry/failover
strategy, most clients will want to failover only on network exceptions in which an RPC is
guaranteed to have not reached the remote process, or in cases in which the particular method
to be retried will have no mutative ill-effects if retried (e.g. read operations or idempotent
write operations.) This JIRA thus introduces a new {{RetryPolicy}} implementation - {{FailoverOnNetworkExceptionRetry}}.
This retry policy fails over whenever a method call fails in such a way as to guarantee that
it did not reach the original remote process, or when retrying a method which can be safely
retried. In order to know which methods of an IPC interface can be safely retried, this JIRA
introduces an {{@Idempotent}} method annotation which, if present, will be passed on to the
retry policy by the {{RetryInvocationHandler}} when determining whether or not to retry a
method. This addresses goal 5.

> Add client failover functionality to o.a.h.io.(ipc|retry)
> ---------------------------------------------------------
>
>                 Key: HADOOP-7380
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7380
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: ipc
>    Affects Versions: 0.23.0
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>             Fix For: 0.23.0
>
>         Attachments: hadoop-7380-hdfs-example.patch, hadoop-7380.0.patch, hadoop-7380.1.patch,
hadoop-7380.2.patch, hdfs-7380.3.patch
>
>
> Implementing client failover will likely require changes to {{o.a.h.io.ipc}} and/or {{o.a.h.io.retry}}.
This JIRA is to track those changes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message