hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3132) DFS writes stuck occationally
Date Fri, 18 Apr 2008 20:50:21 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590611#action_12590611

Raghu Angadi commented on HADOOP-3132:

tcpdump on the sender (second datanode in the pipeline) shows the TCP connection was stuck
because of a missing packet. Retransmission of the missing packet does not seem to be accepted
by the receiver (might be because of wrong checksum, did not capture traffic on the receiver,
will try next time).

I got the traffic for last 3-4 minutes on the sender before connection was broken. This explains
all the observations :
# sender has lot of data in its 'sendbuf'
# receiver has a lot of data in its 'recvbuf', but DataNode is blocked in this socket's read.
# after 16 minutes or so sender's write fails with 'connect timeout' exception.

The missing packet is also confirmed by the fact that every packet from the remote side has
(tcp option) SACK data with "1448-31332 (relative values)". This implies the receiver is missing
first 1448 bytes from the acked seqno.

There are two retransmissions  of this missing packets in the capture (2 min apart). ethereal
says that checksum is incorrect (not sure if this is dependable since we are not sure if checksum
is offloaded etc). But in both cases the packet has same wrong value though it needs to be
different because of different TCP headers. Traffic on receiver would make this more clear.

In any case, this is not an application bug.

> DFS writes stuck occationally
> -----------------------------
>                 Key: HADOOP-3132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Runping Qi
>            Assignee: Raghu Angadi
>             Fix For: 0.18.0
> This problem happens in 0.17 trunk
> As reported in hadoop-3124,
> I saw reducers waited 10 minutes for writing data to dfs and got timeout.
> The client retries again and timeouted after another 19 minutes.
> During the period of write stuck, all the nodes in the data node pipeline were functioning
> The system load was normal.
> I don't believe this was due to slow network cards/disk drives or overloaded machines.
> I believe this and hadoop-3033 are related somehow.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message