hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From akshaymb <akshaybhara...@gmail.com>
Subject Help with DFSClient Exception.
Date Mon, 28 May 2012 10:27:27 GMT

Hi,

We are frequently observing the exception 
java.io.IOException: DFSClient_attempt_201205232329_28133_r_000002_0 could
not complete file
/output/tmp/test/_temporary/_attempt_201205232329_28133_r_000002_0/part-r-00002. 
Giving up. 
on our cluster.  The exception occurs during writing a file.  We are using
Hadoop 0.20.2. It’s ~250 nodes cluster and on average 1 box goes down every
3 days.

Detailed stack trace :
12/05/27 23:26:54 INFO mapred.JobClient: Task Id :
attempt_201205232329_28133_r_000002_0, Status : FAILED
java.io.IOException: DFSClient_attempt_201205232329_28133_r_000002_0 could
not complete file
/output/tmp/test/_temporary/_attempt_201205232329_28133_r_000002_0/part-r-00002. 
Giving up.
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3331)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3240)
        at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
        at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
        at
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106)
        at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

Our investigation: 
We have min replication factor set to 2.  As mentioned 
http://kazman.shidler.hawaii.edu/ArchDocDecomposition.html here  , “A call
to complete() will not return true until all the file's blocks have been
replicated the minimum number of times.  Thus, DataNode failures may cause a
client to call complete() several times before succeeding”, we should retry
complete() several times.
The org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls
complete() function and retries it for 20 times.  But in spite of that file
blocks are not replicated minimum number of times. The retry count is not
configurable.  Changing min replication factor to 1 is also not a good idea
since there are continuously jobs running on our cluster. 

Do we have any solution / workaround for this problem?

What is min replication factor in general used in industry.

Let me know if any further inputs required.

Thanks,
-Akshay



-- 
View this message in context: http://old.nabble.com/Help-with-DFSClient-Exception.-tp33918949p33918949.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Mime
View raw message