hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7915) The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell the DFSClient about it because of a network error
Date Thu, 12 Mar 2015 04:44:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358092#comment-14358092
] 

Yongjun Zhang commented on HDFS-7915:
-------------------------------------

Hi Colin,

Thanks for the new rev. I found I made a mistake when doing earlier test, I need to include
-Pnative as compile switch to enable to test. After I do that, I can see the test fail even
with rev 001 after reverting DataXceiver.java. Did you speculate the problem when making rev
2?

Some additional comments:

{code}
      fis = datanode.requestShortCircuitFdsForRead(blk, token, maxVersion);
      bld.setStatus(SUCCESS);
      bld.setShortCircuitAccessVersion(DataNode.CURRENT_BLOCK_FORMAT_VERSION);
{code}
Here {{bld}} is set to SUCCESS status, without checking whether fis is null or not. However,
down in the code below:
{code}
 if (fis != null) {
        FileDescriptor fds[] = new FileDescriptor[fis.length];
        ......
        success = true;
 }
{code}
{{success}} is set to true only when {{fis}} is not null. I saw a bit inconsistency here.
Is it success when fis is null? If not, then the first section has an issue. If yes, then
we can probably change {{success}} to {{isFisObtained}}.

It seems when we do the logging below
{code}
   if ((!success) && (registeredSlotId != null)) {
        LOG.info("Unregistering " + registeredSlotId + " because the " +
            "requestShortCircuitFdsForRead operation failed.");
        datanode.shortCircuitRegistry.unregisterSlot(registeredSlotId);
      }
{code}
The reason that we have to unregister a slot could be an exception recorded in {{bld}}, or
because of an exception not currently caught in this method. 

I think we can add code to capture the currently uncaught exception, remember it, then re-throw
it. Such that when we do the logging above in the final block, we can report this exception
as the reason why we are un-registering the slot in this log.

What do you think?

Thanks.






> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell the DFSClient
about it because of a network error
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7915
>                 URL: https://issues.apache.org/jira/browse/HDFS-7915
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-7915.001.patch, HDFS-7915.002.patch
>
>
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell the DFSClient
about it because of a network error.  In {{DataXceiver#requestShortCircuitFds}}, the DataNode
can succeed at the first part (mark the slot as used) and fail at the second part (tell the
DFSClient what it did). The "try" block for unregistering the slot only covers a failure in
the first part, not the second part. In this way, a divergence can form between the views
of which slots are allocated on DFSClient and on server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message