hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-2883) Extensive write failures
Date Tue, 04 Mar 2008 08:51:52 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

dhruba borthakur updated HADOOP-2883:
-------------------------------------

    Attachment: packetResponse.patch

This patch fixes three problems:

1. The datanode used to ack a packet before its content was flushed from the buffered output
stream for the block file. This means that if the flush fails, then  data could get corrupted.
This patch flushes the block file and the metadata file before sending a positive ack to the
client. I have verified that this does degrade performance of dfsiotest and randonwriter.
2. The original timeout value of 1 minute * length-of-pipeline has been restored. This change
reduces the number of socket timeouts when a datanode is heavily loaded.
3. The Datanode verifies that a packet replay does not create holes in the block file (sparse
files). The offset-in-block of every packet should be less than or equal to the size of the
current block file.

> Extensive write failures
> ------------------------
>
>                 Key: HADOOP-2883
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2883
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.16.1
>
>         Attachments: packetResponse.patch
>
>
> With the new release 0.16.0 we experience extensive write failures under heavy load.
> The job shuffles 300TB on 1400 nodes and runs 3 waves of 2500 reducers. Each reducer
uses libhdfs to write to around 70 dfs files simultaneously. We did not experience particular
write problems up to nightly build #835 on hadoopqa (Jan 28),
> but now with released 0.16.0 (candidate 2) we see a lot of exceptions related to 'all
datanodes are bad':
> typical exception(s):
> 08/02/22 10:34:47 WARN fs.DFSClient: Error Recovery for block blk_434406883423887779
in pipeline xxx.xxx.xxx.146:50010, xxx.xxx.xxx.224:50010: bad datanode xxx.xxx.xxx.146:50010
> 08/02/22 10:34:51 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:34:51 WARN fs.DFSClient: Error Recovery for block blk_-1957866292089920792
in pipeline xxx.xxx.xxx.147:50010, xxx.xxx.xxx.10:50010: bad datanode xxx.xxx.xxx.147:50010
> 08/02/22 10:34:54 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:34:54 WARN fs.DFSClient: Error Recovery for block blk_-5265240773298481019
in pipeline xxx.xxx.xxx.152:50010, xxx.xxx.xxx.71:50010: bad datanode xxx.xxx.xxx.152:50010
> 08/02/22 10:34:54 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:34:54 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed outxxx.xxx.xxx.166:50010
> 08/02/22 10:34:55 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:00 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:00 WARN fs.DFSClient: Error Recovery for block blk_8456718220685890569
in pipeline xxx.xxx.xxx.158:50010, xxx.xxx.xxx.225:50010: bad datanode xxx.xxx.xxx.158:50010
> 08/02/22 10:35:00 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:00 WARN fs.DFSClient: Error Recovery for block blk_1420965154382429572
in pipeline xxx.xxx.xxx.169:50010, xxx.xxx.xxx.221:50010: bad datanode xxx.xxx.xxx.169:50010
> 08/02/22 10:35:00 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:00 WARN fs.DFSClient: Error Recovery for block blk_-519424763987472708
in pipeline xxx.xxx.xxx.154:50010, xxx.xxx.xxx.37:50010: bad datanode xxx.xxx.xxx.154:50010
> 08/02/22 10:35:00 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:00 WARN fs.DFSClient: Error Recovery for block blk_-8376556524788296783
in pipeline xxx.xxx.xxx.154:50010, xxx.xxx.xxx.212:50010: bad datanode xxx.xxx.xxx.154:50010
> 08/02/22 10:35:00 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:00 WARN fs.DFSClient: Error Recovery for block blk_-2429564741658530079
in pipeline xxx.xxx.xxx.160:50010, xxx.xxx.xxx.105:50010: bad datanode xxx.xxx.xxx.160:50010
> 08/02/22 10:35:00 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:00 WARN fs.DFSClient: Error Recovery for block blk_-6653210787685458124
in pipeline xxx.xxx.xxx.143:50010, xxx.xxx.xxx.37:50010: bad datanode xxx.xxx.xxx.143:50010
> 08/02/22 10:35:01 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:01 WARN fs.DFSClient: Error Recovery for block blk_7515160028005424426
in pipeline xxx.xxx.xxx.167:50010, xxx.xxx.xxx.152:50010: bad datanode xxx.xxx.xxx.167:50010
> 08/02/22 10:35:03 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:03 WARN fs.DFSClient: Error Recovery for block blk_-7191475898558388503
in pipeline xxx.xxx.xxx.139:50010, xxx.xxx.xxx.6:50010: bad datanode xxx.xxx.xxx.139:50010
> 08/02/22 10:35:03 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:03 WARN fs.DFSClient: Error Recovery for block blk_-340745015348833165
in pipeline xxx.xxx.xxx.141:50010, xxx.xxx.xxx.153:50010: bad datanode xxx.xxx.xxx.141:50010
> 08/02/22 10:35:04 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:04 WARN fs.DFSClient: Error Recovery for block blk_-6861254790596076341
in pipeline xxx.xxx.xxx.157:50010, xxx.xxx.xxx.224:50010: bad datanode xxx.xxx.xxx.157:50010
> 08/02/22 10:35:14 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:14 INFO fs.DFSClient: Abandoning block blk_6188945400680100475
> 08/02/22 10:35:14 INFO fs.DFSClient: Waiting to find target node: xxx.xxx.xxx.161:50010
> 08/02/22 10:35:43 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:47 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:48 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:49 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:49 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:50 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:50 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:50 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:53 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:54 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:57 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:35:57 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:36:03 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:36:03 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:36:03 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:36:03 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:36:03 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:36:03 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:36:04 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:36:06 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:36:06 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> 08/02/22 10:36:07 INFO fs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
> Exception in thread "main" java.io.IOException: All datanodes xxx.xxx.xxx.83:50010 are
bad. Aborting...
> 	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:1839)
> 	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:1479)
> 	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1571)
> Call to org.apache.hadoop.fs.FSDataOutputStream::write failed!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message