hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-7915) The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell the DFSClient about it because of a network error
Date Sat, 14 Mar 2015 01:38:39 GMT

     [ https://issues.apache.org/jira/browse/HDFS-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Colin Patrick McCabe updated HDFS-7915:
---------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.7.0
           Status: Resolved  (was: Patch Available)

committed. thanks, guys.

I will file a follow-up to look into if we can do more logging.  Note that in the specific
case where we caught this bug (writeArray failing), we actually got as much logging as possible
from the DataNode.  Everything we needed was logged there, including the failed domain socket
I/O stack traces.  Similarly, I can't think of any DFSClient logs we needed and didn't get.
 We got the domain socket I/O stack traces there was well.  What we don't know is why the
write failed, but we logged as much information as the kernel gave us (it returned EAGAIN,
which means timeout).

In general socket reads and writes can fail, and HDFS needs to be able to handle that.  The
cause of the timeout in the case we saw is outside the scope of this JIRA.

> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell the DFSClient
about it because of a network error
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7915
>                 URL: https://issues.apache.org/jira/browse/HDFS-7915
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>             Fix For: 2.7.0
>
>         Attachments: HDFS-7915.001.patch, HDFS-7915.002.patch, HDFS-7915.004.patch, HDFS-7915.005.patch,
HDFS-7915.006.patch
>
>
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell the DFSClient
about it because of a network error.  In {{DataXceiver#requestShortCircuitFds}}, the DataNode
can succeed at the first part (mark the slot as used) and fail at the second part (tell the
DFSClient what it did). The "try" block for unregistering the slot only covers a failure in
the first part, not the second part. In this way, a divergence can form between the views
of which slots are allocated on DFSClient and on server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message