hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Foley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
Date Thu, 02 Jun 2011 23:55:47 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043144#comment-13043144
] 

Matt Foley commented on HDFS-2011:
----------------------------------

Hi Ravi, the logic of your changes is fine.  The following comments are almost all regarding
common usages in Hadoop code base and unit tests.

TestCheckpoint.checkEditLogFileOutputStreamCloses():
* To obtain the build/test/data directory correctly, use System.getProperty("test.build.data","/tmp")
rather than hardcoding it; then create your desired file relative to that directory.
* I find "elfosFile" to be a very opaque name.  Would it be reasonable to use something like
"editLogStream" instead?
* Instead of Assert.assertTrue("msg",false), use Assert.fail("msg").
* But, there is no need to catch and message exceptions that shouldn't happen.  Both "catch{}"
clauses add no significant value compared to the stack trace that will be printed on exception,
by junit.  In fact, the stack trace from the catch-and-Assert is LESS informative than the
original exception stack trace would have been, because it points into the catch clause instead
of into where the exception actually occurred.
* It's good that you bracket both the beginning and end with printlns that clearly state what
is being tested; if an exception occurs the developer will immediately see what went wrong
(with the help of the stack trace).
* Within the "finally{}" clause, it might be a good idea to put the delete() call in its own
try/catch context.  If another exception happened, you wouldn't want to interfere with the
original exception message, which carries the info you created the testcase to expose.

checkSetCheckpointTimeInStorageHandlesIOException():
* alFS and alES are also very opaque names.  Hadoop doesn't subscribe to Hungarian naming,
so the "al" prefix isn't needed.  "FS" usually means FileSystem, which isn't the same as a
StorageDirectory.  So consider renaming these, perhaps to fsImageDirs and editsDirs.
* First try/catch context:  Again, there's no need to catch-and-Assert unexpected failures.
 If they occur, they will be duly reported by junit.
* As before, the place to put the directories you create should be relative to System.getProperty("test.build.data","/tmp").
* And to create the URIs, it is probably best to do something equivalent to "new Path(System.getProperty("test.build.data","/tmp"),
"storageDirToCheck").toUri()".  This will work around any filesystem path naming oddities.
* In the assert, use listRsd.get(listRsd.size()-1) instead of listRsd.get(0), because the
new element would be added to the end of the list -- I think :-)
* It might be good to use nnStorage.getEditsDirectories() and/or nnStorage.getImageDirectories()
before deleting the dir, to assure that the setStorageDirectories() had the expected result,
and call nnStorage.getRemovedStorageDirs() before to assure that the list initially does not
contain "storageDirToCheck".

NNStorage.setCheckpointTimeInStorage():
* In the comment "//Since writeCheckpointTime may also encounter an IOException in case underlying
storage fails" substitute "reportErrorsOnDirectories()" for "writeCheckpointTime".
* There is a singular reportErrorsOnDirectory() method.  Could you use it instead of reportErrorsOnDirectories()?
 Then you wouldn't need to construct the ArrayList.
* In the LOG.error if the second IOE happens, suggest 
LOG.error("Failed to report and remove NN storage directory " + sd.getRoot().getPath(), ioe);
Besides clarifying the msg, note that "+ ioe" uses ioe.toString(), which only prints a single
line about the exception, while using ", ioe" in a LOG argument list causes the entire stack
trace to be printed.

EditLogFileOutputStream.close():
* Suggest you separate the bufCurrent and bufReady cases.  Do:
{code}
if (bufCurrent != null) {
    int bufSize = bufCurrent.size();
    if (bufSize != 0) {
      throw new IOException("FSEditStream has " + bufSize
          + " bytes still to be flushed and cannot " + "be closed.");
    }
    bufCurrent.close();
}
if (bufReady != null) {
    bufReady.close();
}
{code}


> Removal and restoration of storage directories on checkpointing failure doesn't work
properly
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-2011
>                 URL: https://issues.apache.org/jira/browse/HDFS-2011
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Ravi Prakash
>         Attachments: HDFS-2011.3.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch
>
>
> Removal and restoration of storage directories on checkpointing failure doesn't work
properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed
storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message