Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Message-ID: <1004431276.139641264801294843.JavaMail.jira@brutus.apache.org>
Date: Fri, 29 Jan 2010 21:41:34 +0000 (UTC)
From: "Tsz Wo (Nicholas), SZE (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Subject: [jira] Commented: (HDFS-127) DFSClient block read failures cause
 open DFSInputStream to become unusable
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806517#action_12806517 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-127:
---------------------------------------------

I ran unit tests on the patch.  The following tests failed
{noformat}
    [junit] Running org.apache.hadoop.io.TestUTF8
    [junit] Tests Run: 3, Failures: 1, Errors: 0, Time elapsed: 0.161 sec

    [junit] Running org.apache.hadoop.hdfsproxy.TestHdfsProxy
    [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 4.219 sec
{noformat}
BTW, since the new patch is quite different from the patch already committed, I suggest to fix the problem in a new issue so that the new patch could be committed to 0.20, 0.21 and trunk.  Otherwise, I am not sure how to commit the new patch.

> DFSClient block read failures cause open DFSInputStream to become unusable
> --------------------------------------------------------------------------
>
>                 Key: HDFS-127
>                 URL: https://issues.apache.org/jira/browse/HDFS-127
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client
>            Reporter: Igor Bolotin
>            Assignee: Igor Bolotin
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: 4681.patch, h127_20091016.patch, h127_20091019.patch, h127_20091019b.patch, hdfs-127-branch20-redone-v2.txt, hdfs-127-branch20-redone.txt, hdfs-127-regression-test.txt
>
>
> We are using some Lucene indexes directly from HDFS and for quite long time we were using Hadoop version 0.15.3.
> When tried to upgrade to Hadoop 0.19 - index searches started to fail with exceptions like:
> 2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
> at java.io.DataInputStream.read(DataInputStream.java:132)
> at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)
> at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
> at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
> at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
> at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
> at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
> at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
> at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
> at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
> at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) 
> ...
> The investigation showed that the root of this issue is that we exceeded # of xcievers in the data nodes and that was fixed by changing configuration settings to 2k.
> However - one thing that bothered me was that even after datanodes recovered from overload and most of client servers had been shut down - we still observed errors in the logs of running servers.
> Further investigation showed that fix for HADOOP-1911 introduced another problem - the DFSInputStream instance might become unusable once number of failures over lifetime of this instance exceeds configured threshold.
> The fix for this specific issue seems to be trivial - just reset failure counter before reading next block (patch will be attached shortly).
> This seems to be also related to HADOOP-3185, but I'm not sure I really understand necessity of keeping track of failed block accesses in the DFS client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.