hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
Date Mon, 30 May 2011 16:43:47 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13041195#comment-13041195
] 

Hadoop QA commented on HDFS-2011:
---------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12480858/HDFS-2011.patch
  against trunk revision 1128987.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/659//console

This message is automatically generated.

> Removal and restoration of storage directories on checkpointing failure doesn't work
properly
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-2011
>                 URL: https://issues.apache.org/jira/browse/HDFS-2011
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Ravi Prakash
>         Attachments: HDFS-2011.patch
>
>
> I had been automating tests to verify the removal and restoration of storage directories.
I was testing by setting up a loopback file system, using that as one of the storage directories,
and filling it up to make the writes from Hadoop namenode to the checkpoint fail. 
> Mostly I would see the functionality work. However, very often I would see this exception
in the logs: 
> 2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage
failed. java.io.IOException: No space left on device
>         at java.io.FileOutputStream.writeBytes(Native Method)
>         at java.io.FileOutputStream.write(FileOutputStream.java:297)
>         at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224)
>         at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101)
>         at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:416)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
>         at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97)
>         at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:416)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
>         at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>         at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
>         at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124)
>         at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871)
>         at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
>         at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
>         at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>         at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>         at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>         at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
>         at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>         at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>         at org.mortbay.jetty.Server.handle(Server.java:324)
>         at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
>         at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
>         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
>         at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
>         at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
>         at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>         at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
> In this case the storage directory wasn't taken offline. It would not be removed from
the list. John George figured out this was because the IOException was happening in a code
path fromm where the function to remove the corresponding wasn't being called. 
> Also, very rarely, I would see this exception
> 2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 8020,
call getEditLogSize() from
> 98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109)
>         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299)
>         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095)
>         at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393)
> After this, the Secondary Namenode and the Namenode would go into infinite loops of this
NullPointerExceptions. John George figured out this was because close was being called on
the editStream twice (so it was trying to close an editstream which was already closed). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message