Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 526296DC6 for ; Mon, 30 May 2011 17:11:32 +0000 (UTC) Received: (qmail 4565 invoked by uid 500); 30 May 2011 17:11:32 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 4535 invoked by uid 500); 30 May 2011 17:11:32 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 4527 invoked by uid 99); 30 May 2011 17:11:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 May 2011 17:11:32 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 May 2011 17:11:29 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id A3128E95B0 for ; Mon, 30 May 2011 17:10:47 +0000 (UTC) Date: Mon, 30 May 2011 17:10:47 +0000 (UTC) From: "Todd Lipcon (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1953857013.53783.1306775447664.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <2086595669.53727.1306773107650.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13041200#comment-13041200 ] Todd Lipcon commented on HDFS-2011: ----------------------------------- Any chance of unit tests for these? > Removal and restoration of storage directories on checkpointing failure doesn't work properly > --------------------------------------------------------------------------------------------- > > Key: HDFS-2011 > URL: https://issues.apache.org/jira/browse/HDFS-2011 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.23.0 > Reporter: Ravi Prakash > Assignee: Ravi Prakash > Attachments: HDFS-2011.patch > > > I had been automating tests to verify the removal and restoration of storage directories. I was testing by setting up a loopback file system, using that as one of the storage directories, and filling it up to make the writes from Hadoop namenode to the checkpoint fail. > Mostly I would see the functionality work. However, very often I would see this exception in the logs: > 2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:297) > at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224) > at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101) > at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:416) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) > at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97) > at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:416) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) > at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) > at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124) > at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871) > at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115) > at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361) > at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) > at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:324) > at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) > at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) > at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) > at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) > In this case the storage directory wasn't taken offline. It would not be removed from the list. John George figured out this was because the IOException was happening in a code path fromm where the function to remove the corresponding wasn't being called. > Also, very rarely, I would see this exception > 2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 8020, call getEditLogSize() from > 98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException > java.io.IOException: java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109) > at org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299) > at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270) > at org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095) > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393) > After this, the Secondary Namenode and the Namenode would go into infinite loops of this NullPointerExceptions. John George figured out this was because close was being called on the editStream twice (so it was trying to close an editstream which was already closed). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira