hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7915) The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell the DFSClient about it because of a network error
Date Fri, 13 Mar 2015 22:27:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361208#comment-14361208
] 

Colin Patrick McCabe commented on HDFS-7915:
--------------------------------------------

bq. 1. I think we should look harder in logging a reason when having to unregister a slot
for better supportability (e.g., we want to find out the root cause). I agree that to make
it 100% right would result in too complex logic though. I would propose the following:

I understand your concerns, but every log I've looked at does display the reason why the fd
passing failed, including the full exception.  It simply is logged in a catch block further
up in the DataXceiver.  Logging it again in this function would just be repetitious.  Sorry
if that was unclear.

bq. 2. question: change in BlockReaderFactory.java to move  "return new ShortCircuitReplicaInfo(replica);"
to within the try block is not important, I mean, it's ok not to move it, correct?

Yes, it is OK not to move it, because currently the ShortCircuitReplicaInfo can't fail (never
throws).  But it is better to have it in the catch block in case the constructor later has
a throw... added to it.  It is safer.

bq. suggest to change sock.getOutputStream().write((byte).. to sock.getOutputStream().write((int),
since we are using {{DomainSocket#public void write(int val) throws IOException }} API.

OK

bq. Should we define "0" as an constant somewhere and check equivalence instead of "val <
0" at the reader?

It's not necessary.  We don't care what the value is.  Adding checks is actually bad because
it means we can't decide to use it later for some other purpose.

bq. Looks to me that the message should be "Reading receipt byte for ...". right?

thanks, fixed

> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell the DFSClient
about it because of a network error
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7915
>                 URL: https://issues.apache.org/jira/browse/HDFS-7915
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-7915.001.patch, HDFS-7915.002.patch, HDFS-7915.004.patch
>
>
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell the DFSClient
about it because of a network error.  In {{DataXceiver#requestShortCircuitFds}}, the DataNode
can succeed at the first part (mark the slot as used) and fail at the second part (tell the
DFSClient what it did). The "try" block for unregistering the slot only covers a failure in
the first part, not the second part. In this way, a divergence can form between the views
of which slots are allocated on DFSClient and on server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message