hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Prakash (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
Date Wed, 06 Jul 2011 19:16:17 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060775#comment-13060775
] 

Ravi Prakash commented on HDFS-2011:
------------------------------------

I had noticed close being called twice while testing this functionality . This was causing
a NullPointerException the second time. The stack trace is given in comment https://issues.apache.org/jira/browse/HDFS-2011?focusedCommentId=13041858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13041858

{quote}
2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 8020,
call getEditLogSize() from
98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270)
at org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393)
{quote}

The bug itself is quite hard to reproduce. I had to run my tests in an infinite loop and the
NullPointerException happened after 3-4 hours (each run of the test would take 2 mins maybe).
After the NullPointerException, the namenode would essentially be useless. Even hdfs dfs -ls
would throw a NullPointerException.

I am not sure myself which philosophy would be better. FileOutputStream itself ignores a second
close. I checked this with the following program

{noformat}
import java.io.*;

public class TestJAVA 
{

	public static void main(String args[]) 
	{
		System.out.println("Hello World");
		try {
		
			FileOutputStream fos = new FileOutputStream("/tmp/ravi.txt");
			fos.write(50);
			fos.write(50);
			fos.write(50);
			fos.write(50);
			fos.write(50);
			fos.write(50);
			fos.close();
			fos.close();
		} catch (IOException ioe) {
			System.out.println("Hello California");
			System.out.println (ioe);
		}
		System.out.println("Hello Champaign");
		
	}
	
}
{noformat}

> Removal and restoration of storage directories on checkpointing failure doesn't work
properly
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-2011
>                 URL: https://issues.apache.org/jira/browse/HDFS-2011
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Ravi Prakash
>             Fix For: 0.23.0
>
>         Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch,
HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch
>
>
> Removal and restoration of storage directories on checkpointing failure doesn't work
properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed
storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message