hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7915) The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell the DFSClient about it because of a network error
Date Thu, 12 Mar 2015 20:38:38 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359351#comment-14359351
] 

Colin Patrick McCabe commented on HDFS-7915:
--------------------------------------------

bq. Here bld is set to SUCCESS status, without checking whether fis is null or not. However,
down in the code below:.... success is set to true only when fis is not null. I saw a bit
inconsistency here. Is it success when fis is null? If not, then the first section has an
issue. If yes, then we can probably change success to isFisObtained.

There is no inconsistency.  {{DataNode#requestShortCircuitFdsForRead}} cannot return null.
 It can only throw an exception or return some fds.  There is a difference between attempting
to send a SUCCESS response to the DFSClient, and the whole function being successful.  Just
because we attempted to send a SUCCESS response doesn't mean we actually did it.  We must
actually send the fds and the response to succeed.

I will add a Precondition check to make it clearer that {{fis}} cannot be null when a SUCCESS
response is being sent.

bq. The reason that we have to unregister a slot could be an exception recorded in bld, or
because of an exception not currently caught in this method. I think we can add code to capture
the currently uncaught exception, remember it, then re-throw it. Such that when we do the
logging above in the final block, we can report this exception as the reason why we are un-registering
the slot in this log.

I think this would add too much complexity.  If we catch Throwable, we can't re-throw Throwable.
 So we'd have to have separate catch blocks for RuntimeException, IOException, and probably
another block to catch other things.

> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell the DFSClient
about it because of a network error
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7915
>                 URL: https://issues.apache.org/jira/browse/HDFS-7915
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-7915.001.patch, HDFS-7915.002.patch
>
>
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell the DFSClient
about it because of a network error.  In {{DataXceiver#requestShortCircuitFds}}, the DataNode
can succeed at the first part (mark the slot as used) and fail at the second part (tell the
DFSClient what it did). The "try" block for unregistering the slot only covers a failure in
the first part, not the second part. In this way, a divergence can form between the views
of which slots are allocated on DFSClient and on server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message