hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peyman Mohajerian <mohaj...@gmail.com>
Subject Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK
Date Wed, 29 Jan 2014 02:20:11 GMT
maybe its inode exhaustion:
'df -i' command can tell you more.


On Mon, Jan 27, 2014 at 12:00 PM, John Lilley <john.lilley@redpoint.net>wrote:

>  I've found that the error occurs right around a threshold where 20 tasks
> attempt to open 220 files each.  This is ... slightly over 4k total files
> open.
>
> But that's the total number of open files across the 4-node cluster, and
> since the blocks are evenly distributed, that amounts to 1k connections per
> node, which should not be a problem.
>
> I've run tests wherein a single process on a single node can open over 8k
> files without issue.
>
> I think that there is some other factor at work, perhaps one of:
>
> 1)      Timing (because the files were just written),
>
> 2)      Multi-node, multi-process access to the same set of files.
>
> 3)      Replication=1 having an influence.
>
>
>
> Any ideas?  I am not seeing any errors in the datanode logs.
>
>
>
> I will run some other tests with replication=3 to see what happens.
>
>
>
> John
>
>
>
>
>
> *From:* John Lilley [mailto:john.lilley@redpoint.net]
> *Sent:* Monday, January 27, 2014 8:41 AM
> *To:* user@hadoop.apache.org
> *Subject:* RE: BlockMissingException reading HDFS file, but the block
> exists and fsck shows OK
>
>
>
> None of the datanode logs have error messages.
>
>
>
> *From:* Harsh J [mailto:harsh@cloudera.com <harsh@cloudera.com>]
> *Sent:* Monday, January 27, 2014 8:15 AM
> *To:* <user@hadoop.apache.org>
> *Subject:* Re: BlockMissingException reading HDFS file, but the block
> exists and fsck shows OK
>
>
>
> Can you check the log of the DN that is holding the specific block for any
> errors?
>
> On Jan 27, 2014 8:37 PM, "John Lilley" <john.lilley@redpoint.net> wrote:
>
> I am getting this perplexing error.  Our YARN application launches tasks
> that attempt to simultaneously open a large number of files for merge.
> There seems to be a load threshold in terms of number of simultaneous tasks
> attempting to open a set of HDFS files on a four-node cluster.  The
> threshold is hit at 32 tasks, each opening 450 files.  The threshold is not
> hit at 16 tasks, each opening 250 files.
>
>
>
> The files are stored in HDFS with replication=1.  I know that low
> replication leaves me open to node-failure issues, but bear with me,
> nothing is actually failing.
>
>
>
> I get this exception when attempting to open a file:
>
> org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
>
>
> Could not obtain block:
> BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
>
> file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
>
>
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
>
>
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
>
>     org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
>
>     org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)
>
>
>
> However, the block is definitely **not** missing.  I can be running the
> following command continuously while all of this is going on:
>
> hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
>
> Well before the tasks start it is showing good files all around, including:
>
> /rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614
> bytes, 2 block(s):  OK
>
> 0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385
> len=134217728 repl=1 [192.168.57.110:50010]
>
> 1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
> len=10620886 repl=1 [192.168.57.110:50010]
>
>
>
> My application logs also show that **some** tasks are able to open the
> files for which a missing block is reported.
>
> In case you suspect, the files are not being deleted.  The fsck continues
> to show good status for these files well after the error report.
>
> I've also checked to ensure that the files are not being held open by the
> creators of the files.
>
>
>
> This leads me to believe that I've hit a an HDFS open-file limit of some
> kind.  We can compensate pretty easily, by doing a two-phase merge that
> opens far fewer files simultaneously, keeping a limited pool of open files,
> etc.  However, I would still like to know what limit is being hit, and how
> to best predict that limit on various cluster configurations.
>
>
>
> Thanks,
>
> john
>

Mime
View raw message