hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7472) RPC client should deal with the IP address changes
Date Wed, 27 Jul 2011 19:28:09 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071934#comment-13071934
] 

Suresh Srinivas commented on HADOOP-7472:
-----------------------------------------

bq. Either way it's not clean as long as the upper layer keeps using the old resolved address.
My preference is for the upper layer to not care about the lower layer connection semantics.

bq. Re-resolving part is fine. But the fact that the callers keep the old address in their
immutable InetSocketAddress object can cause problems. Since we cannot afford to check the
mapping in every RPC invocation by re-resolving, the connection will go through if NN is started
with the old address by mistake. Also the connection cache in RPC Client uses the address
as the key. Do we keep the old key? Or update with the new one? Either way it's not clean
as long as the upper layer keeps using the old resolved address.
Upper layers pass the InetSocketAddress down. They do not hold on to it. Can you point to
where it is held on to? I am not sure going to old address if NN is restarted is a critical
problem that we need to deal with. Given the scenario you are solving, it is unlikely.
One of the things I was thinking was to replace InetSocketAddress to the underlying layers
with a wrapper, which allows updating the address with new resolved address.

Some comments for the patch:
# DFSClient did not have any notion of addresses. It was only in the layer below. The current
code handles exception in every RPC call. This repetitive code should be avoided. Also as
you noted, this only works for DFS and not for all the RPCs.
# IPC Client now introduces new exception. All the implementations that currently use IPC/RPC
do not handle this exception gracefully.


I also think we should create a jira to ensure this is tested in 0.23 and does not break.
I am not sure if this should be a blocker for 0.22?


> RPC client should deal with the IP address changes
> --------------------------------------------------
>
>                 Key: HADOOP-7472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7472
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.20.205.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Minor
>             Fix For: 0.20.205.0
>
>         Attachments: addr_change_dfs-1.patch.txt, addr_change_dfs.patch.txt
>
>
> The current RPC client implementation and the client-side callers assume that the hostname-address
mappings of servers never change. The resolved address is stored in an immutable InetSocketAddress
object above/outside RPC, and the reconnect logic in the RPC Connection implementation also
trusts the resolved address that was passed down.
> If the NN suffers a failure that requires migration, it may be started on a different
node with a different IP address. In this case, even if the name-address mapping is updated
in DNS, the cluster is stuck trying old address until the whole cluster is restarted.
> The RPC client-side should detect this situation and exit or try to recover.
> Updating ConnectionId within the Client implementation may get the system work for the
moment, there always is a risk of the cached address:port become connectable again unintentionally.
The real solution will be notifying upper layer of the address change so that they can re-resolve
and retry or re-architecture the system as discussed in HDFS-34. 
> For 0.20 lines, some type of compromise may be acceptable. For example, raise a custom
exception for some well-defined high-impact upper layer to do re-resolve/retry, while other
will have to restart.  For TRUNK, the HA work will most likely determine what needs to be
done.  So this Jira won't cover the solutions for TRUNK.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message