hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Clampffer (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-11028) libhdfs++: FileHandleImpl::CancelOperations needs to be able to cancel pending connections
Date Tue, 18 Oct 2016 21:23:59 GMT
James Clampffer created HDFS-11028:

             Summary: libhdfs++: FileHandleImpl::CancelOperations needs to be able to cancel
pending connections
                 Key: HDFS-11028
                 URL: https://issues.apache.org/jira/browse/HDFS-11028
             Project: Hadoop HDFS
          Issue Type: Sub-task
          Components: hdfs-client
            Reporter: James Clampffer
            Assignee: James Clampffer

Cancel support is now reasonably robust except the case where a FileHandle operation ends
up causing the RpcEngine to try to create a new RpcConnection.  In HA configs it's common
to have something like 10-20 failovers and a 20 second failover delay (no exponential backoff
just yet). This means that all of the functions with synchronous interfaces can still block
for many minutes after an operation has been canceled, and often the cause of this is something
trivial like a bad config file.

The current design makes this sort of thing tricky to do because the FileHandles need to be
individually cancelable via CancelOperations, but they share the RpcEngine that does the async

A non-exhaustive list of design assumptions:
1) multiple users will be doing stuff on the same FS in the same process, and some users might
be a lot more impatient than others.  This means that it's possible that progress is slow
and they want to give up but it wasn't stalled and other users are still able to make progress.
Side effects of a FileHandle::CancelOperations call should only be visible to the owner of
that FH.
2) In most use cases the library is spending more time in the read path than namenode metadata
operations.  At any given time it's unlikely that there are a crazy amount of pending RPC
requests though this certainly can happen (see [~anatoli.shein]'s awesome tools).

Some sparse design plans to help out reviewers:
1a) RPC Request objects get something analogous to the ReaderGroup to track all pending requests
associated with a FileHandle.  As long as there is a transitive dependency on the FH from
the request a flag can be pushed down.
1b) FileSystem operations also need the same support.  Since they return their result directly
there isn't an object to call a cancel method on.  One approach here would be to pass in an
optional flag (CancelHandle object).

2) Based on assumption 2 it's generally not unacceptably expensive to cancel and resend async
RPC calls.  Since the RpcConnection is shared for all pending requests it needs to be wiped
out.  This will cause all pending and on-the-fly requests to return asio::operation_aborted
status.  If the Request object doesn't have it's flag set to canceled it gets placed back
in line using the same mechanism as common RPC errors.  This retry does not count against
the retry_count or failover_count since it's a side effect of the cancel.  Nor should this
cause the RpcEngine to attempt to fail over.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message