hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <john.lil...@redpoint.net>
Subject BlockMissingException reading HDFS file, but the block exists and fsck shows OK
Date Mon, 27 Jan 2014 15:06:37 GMT
I am getting this perplexing error.  Our YARN application launches tasks that attempt to simultaneously
open a large number of files for merge.  There seems to be a load threshold in terms of number
of simultaneous tasks attempting to open a set of HDFS files on a four-node cluster.  The
threshold is hit at 32 tasks, each opening 450 files.  The threshold is not hit at 16 tasks,
each opening 250 files.

The files are stored in HDFS with replication=1.  I know that low replication leaves me open
to node-failure issues, but bear with me, nothing is actually failing.

I get this exception when attempting to open a file:
org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
    org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
    org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
    org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
    org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)

However, the block is definitely *not* missing.  I can be running the following command continuously
while all of this is going on:
hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
Well before the tasks start it is showing good files all around, including:
/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614 bytes, 2 block(s):
 OK
0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385 len=134217728 repl=1 [192.168.57.110:50010]
1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 len=10620886 repl=1 [192.168.57.110:50010]

My application logs also show that *some* tasks are able to open the files for which a missing
block is reported.
In case you suspect, the files are not being deleted.  The fsck continues to show good status
for these files well after the error report.
I've also checked to ensure that the files are not being held open by the creators of the
files.

This leads me to believe that I've hit a an HDFS open-file limit of some kind.  We can compensate
pretty easily, by doing a two-phase merge that opens far fewer files simultaneously, keeping
a limited pool of open files, etc.  However, I would still like to know what limit is being
hit, and how to best predict that limit on various cluster configurations.

Thanks,
john

Mime
View raw message