hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Clampffer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10247) libhdfs++: Datanode protocol version mismatch
Date Tue, 12 Apr 2016 22:29:25 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238125#comment-15238125
] 

James Clampffer commented on HDFS-10247:
----------------------------------------

Thanks for taking a look at this Bob.

bq. Is there a reason we made buf_ a string rather than a vector<unsigned char>? While
the implementations are very similar, they're semantically very different things, and we're
definitely not dealing with text here.

Protobuf has an odd API there.  You either have to use an ArrayOutputStream or a StringOutputStream
as an output for the serialized message and both rely on some other structure to hold the
allocated memory. ArrayOutputStream takes a void* and length of a preallocated buffer, and
since that code path handles a few different message types efficient preallocation would be
tricky.  StringOutputStream grows the underlying std::string as needed and then you get to
use .size() to see how much data was serialized.  I'd prefer a vector<uint8_t> output
stream but at least the StringOutputStream gives most of the same API and RAII.  It doesn't
look like vector has a string&& constructor so getting the correct semantics requires
a extra malloc and memcpy which doesn't seem worth it here.


> libhdfs++: Datanode protocol version mismatch
> ---------------------------------------------
>
>                 Key: HDFS-10247
>                 URL: https://issues.apache.org/jira/browse/HDFS-10247
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>            Assignee: James Clampffer
>         Attachments: HDFS-10247.HDFS-8707.000.patch, HDFS-10247.HDFS-8707.001.patch,
HDFS-10247.HDFS-8707.002.patch
>
>
> Occasionally "Version Mismatch (Expected: 28, Received: 22794 )" shows up in the logs.
 This doesn't happen much at all with less than 500 concurrent reads and starts happening
often enough to be an issue at 1000 concurrent reads.
> I've seen 3 distinct numbers: 23050 (most common), 22538, and 22794.  If you break these
shorts into bytes you get
> {code}
> 23050 -> [90,10]
> 22794 -> [89,10]
> 22538 -> [88,10]
> {code}
> Interestingly enough if we dump buffers holding protobuf messages just before they hit
the wire we see things like the following with the first two bytes as 90,10
> {code}
> buffer ={90,10,82,10,64,10,52,10,37,66,80,45,49,51,56,49,48,51,51,57,57,49,45,49,50,55,46,48,46,48,46,49,45,49,52,53,57,53,50,53,54,49,53,55,50,53,16,-127,-128,-128,-128,4,24,-23,7,32,-128,-128,64,18,8,10,0,18,0,26,0,34,0,18,14,108,105,98,104,100,102,115,43,43,95,75,67,43,49,16,0,24,23,32,1}
> {code}
> The first 3 bytes the DN is expecting for an unsecured read block request = 
> {code}
> {0,28,81} //[0, 28]->a short for protocol, 81 is read block opcode
> {code}
> This seems like either connections are getting swapped between readers or
> the header isn't being sent for some reason but the protobuf message is.
> I've ruled out memory stomps on the header data (see HDFS-10241) by sticking the 3 byte
header in it's own static buffer that all requests use.
> Some notes:
> -The mismatched number will stay the same for the duration of a stress test.
> -The mismatch is distributed fairly evenly throughout the logs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message