hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Clampffer (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-11028) libhdfs++: FileHandleImpl::CancelOperations needs to be able to cancel pending connections
Date Wed, 04 Jan 2017 16:11:00 GMT

     [ https://issues.apache.org/jira/browse/HDFS-11028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

James Clampffer updated HDFS-11028:
-----------------------------------
    Attachment: HDFS-11028.HDFS-8707.001.patch

Updated patch:
-Isolated the connection cancel logic from the general RPC cancel logic, this patch just does
connection.
-Cleaned up an example that can also be used as a simple test for cancel

To test:
1) Build libhdfs++, set $HADOOP_CONF_DIR to some valid configs for a running cluster (best
to have an HA cluster).  It should go connect to the cluster.
2) Now copy the good config and do something like replace all of the NN port numbers with
something invalid so libhdfs keeps getting connection refused or timeout errors.  You should
be able to quit early with Control-C.

Everything should be fairly clean under valgrind.  There's a few statically initialized objects
that make noise but it shouldn't be anything from inside libhdfs++.

Todo:
-Simple C binding to set up an hdfsFS without connection so it can be passed to an hdfsCancelPendingConnect
function.

> libhdfs++: FileHandleImpl::CancelOperations needs to be able to cancel pending connections
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11028
>                 URL: https://issues.apache.org/jira/browse/HDFS-11028
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>            Assignee: James Clampffer
>         Attachments: HDFS-11028.HDFS-8707.000.patch, HDFS-11028.HDFS-8707.001.patch
>
>
> Cancel support is now reasonably robust except the case where a FileHandle operation
ends up causing the RpcEngine to try to create a new RpcConnection.  In HA configs it's common
to have something like 10-20 failovers and a 20 second failover delay (no exponential backoff
just yet). This means that all of the functions with synchronous interfaces can still block
for many minutes after an operation has been canceled, and often the cause of this is something
trivial like a bad config file.
> The current design makes this sort of thing tricky to do because the FileHandles need
to be individually cancelable via CancelOperations, but they share the RpcEngine that does
the async magic.
> Updated design:
> Original design would end up forcing lots of reconnects.  Not a huge issue on an unauthenticated
cluster but on a kerberized cluster this is a recipe for Kerberos thinking we're attempting
a replay attack.
> User visible cancellation and internal resources cleanup are separable issues.  The former
can be implemented by atomically swapping the callback of the operation to be canceled with
a no-op callback.  The original callback is then posted to the IoService with an OperationCanceled
status and the user is no longer blocked.  For RPC cancels this is sufficient, it's not expensive
to keep a request around a little bit longer and when it's eventually invoked or timed out
it invokes the no-op callback and is ignored (other than a trace level log notification).
 Connect cancels push a flag down into the RPC engine to kill the connection and make sure
it doesn't attempt to reconnect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message