Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 10763 invoked from network); 29 Jan 2010 22:49:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Jan 2010 22:49:58 -0000 Received: (qmail 4187 invoked by uid 500); 29 Jan 2010 22:49:58 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 4133 invoked by uid 500); 29 Jan 2010 22:49:58 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 4123 invoked by uid 99); 29 Jan 2010 22:49:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jan 2010 22:49:58 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jan 2010 22:49:56 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id D151329A0034 for ; Fri, 29 Jan 2010 14:49:34 -0800 (PST) Message-ID: <1729251995.142011264805374856.JavaMail.jira@brutus.apache.org> Date: Fri, 29 Jan 2010 22:49:34 +0000 (UTC) From: "Tsz Wo (Nicholas), SZE (JIRA)" To: hdfs-issues@hadoop.apache.org Subject: [jira] Resolved: (HDFS-127) DFSClient block read failures cause open DFSInputStream to become unusable MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HDFS-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE resolved HDFS-127. ----------------------------------------- Resolution: Fixed > I actually did open a new patch for this issue on trunk - HDFS-927 which is linked here. Sorry for the confusion. Great! Let's close this and work on HDFS-927. > DFSClient block read failures cause open DFSInputStream to become unusable > -------------------------------------------------------------------------- > > Key: HDFS-127 > URL: https://issues.apache.org/jira/browse/HDFS-127 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs client > Reporter: Igor Bolotin > Assignee: Igor Bolotin > Fix For: 0.21.0, 0.22.0 > > Attachments: 4681.patch, h127_20091016.patch, h127_20091019.patch, h127_20091019b.patch, hdfs-127-branch20-redone-v2.txt, hdfs-127-branch20-redone.txt, hdfs-127-regression-test.txt > > > We are using some Lucene indexes directly from HDFS and for quite long time we were using Hadoop version 0.15.3. > When tried to upgrade to Hadoop 0.19 - index searches started to fail with exceptions like: > 2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708) > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536) > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663) > at java.io.DataInputStream.read(DataInputStream.java:132) > at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174) > at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152) > at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) > at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) > at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63) > at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131) > at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162) > at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223) > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217) > at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) > ... > The investigation showed that the root of this issue is that we exceeded # of xcievers in the data nodes and that was fixed by changing configuration settings to 2k. > However - one thing that bothered me was that even after datanodes recovered from overload and most of client servers had been shut down - we still observed errors in the logs of running servers. > Further investigation showed that fix for HADOOP-1911 introduced another problem - the DFSInputStream instance might become unusable once number of failures over lifetime of this instance exceeds configured threshold. > The fix for this specific issue seems to be trivial - just reset failure counter before reading next block (patch will be attached shortly). > This seems to be also related to HADOOP-3185, but I'm not sure I really understand necessity of keeping track of failed block accesses in the DFS client. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.