Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 29760 invoked from network); 5 Sep 2007 02:43:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Sep 2007 02:43:09 -0000 Received: (qmail 40064 invoked by uid 500); 5 Sep 2007 02:43:03 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 40033 invoked by uid 500); 5 Sep 2007 02:43:03 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 40021 invoked by uid 99); 5 Sep 2007 02:43:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Sep 2007 19:43:03 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Sep 2007 02:44:22 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 3C9BE714209 for ; Tue, 4 Sep 2007 19:42:45 -0700 (PDT) Message-ID: <22385155.1188960165245.JavaMail.jira@brutus> Date: Tue, 4 Sep 2007 19:42:45 -0700 (PDT) From: "Konstantin Shvachko (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1076) Periodic checkpointing cannot resume if the secondary name-node fails. In-Reply-To: <10755861.1173240564117.JavaMail.root@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524955 ] Konstantin Shvachko commented on HADOOP-1076: --------------------------------------------- I managed to corrupt current name-node image using your patch. Actually the image was set to an empty file, so that the name-node would not even restart after the checkpoint. I started 2 secondary nodes. The first of them was in the middle of getFSImage(), when the second called rollFSImage() and received the following exception: java.lang.IllegalStateException: Committed at org.mortbay.jetty.servlet.ServletHttpResponse.resetBuffer(ServletHttpResponse.java:212) at org.mortbay.jetty.servlet.ServletHttpResponse.sendError(ServletHttpResponse.java:375) at org.apache.hadoop.dfs.SecondaryNameNode$GetImageServlet.doGet(SecondaryNameNode.java:455) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) This is likely to be related to the patch, since the second secondary node would just get an exception trying to rollEditsLog(). My guess is that your patch prohibits to rollFSImage() if the edits log was not rolled, instead of prohibiting 2 simultaneous rollFSImage(). > Periodic checkpointing cannot resume if the secondary name-node fails. > ---------------------------------------------------------------------- > > Key: HADOOP-1076 > URL: https://issues.apache.org/jira/browse/HADOOP-1076 > Project: Hadoop > Issue Type: Bug > Components: dfs > Reporter: Konstantin Shvachko > Assignee: dhruba borthakur > Fix For: 0.15.0 > > Attachments: secondaryRestart.patch > > > If secondary name-node fails during checkpointing then the primary node will have 2 edits file. > "edits" - is the one which current checkpoint is to be based upon. > "edits.new" - is where new name space edits are currently logged. > The problem is that the primary node cannot do checkpointing until "edits.new" file is in place. > That is, even if the secondary name-node is restarted periodic checkpointing is not going to be resumed. > In fact the primary node will be throwing an exception complaining about the existing "edits.new" > There is only one way to get rid of the edits.new file - to restart the primary name-node. > So in a way if secondary name-node fails then you should restart the whole cluster. > Here is a rather simple modification to the current approach, which we discussed with Dhruba. > When secondary node requests to rollEditLog() the primary node should roll the edit log only if > it has not been already rolled. Otherwise the existing "edits" file will be used for checkpointing > and the primary node will keep accumulating new edits in the "edits.new". > In order to make it work the primary node should also ignore any rollFSImage() requests when it > already started to perform one. Otherwise the new image can become corrupted if two secondary > nodes request to rollFSImage() at the same time. > 2. Also, after the periodic checkpointing patch HADOOP-227 I see pieces of unusable code. > I noticed one data member SecondaryNameNode.localName and at least 4 methods in FSEditLog > that are not used anywhere. We should remove them and others alike if found. > Supporting unusable code is such a waist of time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.