hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4681) DFSClient block read failures cause open DFSInputStream to become unusable
Date Tue, 25 Nov 2008 23:57:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650802#action_12650802
] 

Chris Douglas commented on HADOOP-4681:
---------------------------------------

The current patch causes about 2x (st x=dfs.client.max.block.acquire.failures) the number
retries for an unrecoverable block. It really retries at N\*x, where N is the initial value
of {{retries}} in DFSClint.DFSInputStream::read. To be fair, it looks like the current code
exercises the same retry logic, but without resetting {{failures}}, the first exhaustion of
sources causes it to bail out. It's not clear what the semantics are supposed to be in this
case, but it's worth noting that this patch would change them.

bq. This seems to be also related to HADOOP-3185, but I'm not sure I really understand necessity
of keeping track of failed block accesses in the DFS client.
IIRC, the intent of HADOOP-3185 was to avoid filling the deadNodes list with good nodes hosting
earler, bad blocks. It also improves on the quick fix in HADOOP-1911.

I haven't been able to find a path where applying the patch reintroduces an infinite loop.

> DFSClient block read failures cause open DFSInputStream to become unusable
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-4681
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4681
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.18.2, 0.19.0, 0.19.1, 0.20.0
>            Reporter: Igor Bolotin
>             Fix For: 0.19.1, 0.20.0
>
>         Attachments: 4681.patch
>
>
> We are using some Lucene indexes directly from HDFS and for quite long time we were using
Hadoop version 0.15.3.
> When tried to upgrade to Hadoop 0.19 - index searches started to fail with exceptions
like:
> 2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: java.io.IOException:
Could not obtain block: blk_5604690829708125511_15489 file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
> at java.io.DataInputStream.read(DataInputStream.java:132)
> at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)
> at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
> at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
> at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
> at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
> at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
> at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
> at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
> at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
> at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) 
> ...
> The investigation showed that the root of this issue is that we exceeded # of xcievers
in the data nodes and that was fixed by changing configuration settings to 2k.
> However - one thing that bothered me was that even after datanodes recovered from overload
and most of client servers had been shut down - we still observed errors in the logs of running
servers.
> Further investigation showed that fix for HADOOP-1911 introduced another problem - the
DFSInputStream instance might become unusable once number of failures over lifetime of this
instance exceeds configured threshold.
> The fix for this specific issue seems to be trivial - just reset failure counter before
reading next block (patch will be attached shortly).
> This seems to be also related to HADOOP-3185, but I'm not sure I really understand necessity
of keeping track of failed block accesses in the DFS client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message