hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-3820) Splitlog() executed while the namenode was in safemode may cause data-loss
Date Wed, 04 May 2011 19:48:03 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028918#comment-13028918

stack commented on HBASE-3820:

Here is some feedback on the patch Jieshan:

You change the test in MasterFileSystem to test if filesystem is Writable, as opposed to available
but the message you throw is "File system is not available".  Actually, you create an IOE
but you do not throw it.  You just pass it to abort.

Why remove the try/catch from MasterFileSystem#checkFileSystem?  You can only do that because
you discard any IOEs the come up when you do this check 'if (dfs.exists(new Path("/")) &&
!checkDfsSafeMode(conf)) {'.  That does not seem wise (You are hiding information on why the
FS is not available/writable).

The method you add to FSUtils is called checkFileSystemWritable yet it does not write the
FS; it tests existence of '/' and checks for safe mode.

Do we want to abort the master if in safe mode?  Would it not be better for the master to
just wait on expiration of fs safe mode (You do it in HLogSplitter but you only wait ten seconds
which seems short).

> Splitlog() executed while the namenode was in safemode may cause data-loss
> --------------------------------------------------------------------------
>                 Key: HBASE-3820
>                 URL: https://issues.apache.org/jira/browse/HBASE-3820
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.2
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>         Attachments: HBASE-3820-MFSFix-90-V2.patch, HBASE-3820-MFSFix-90.patch
> I found this problem while the namenode went into safemode due to some unclear reasons.

> There's one patch about this problem:
>    try {
>       HLogSplitter splitter = HLogSplitter.createLogSplitter(
>         conf, rootdir, logDir, oldLogDir, this.fs);
>       try {
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
>         LOG.warn("Retrying splitting because of:", e);
>         // An HLogSplitter instance can only be used once.  Get new instance.
>         splitter = HLogSplitter.createLogSplitter(conf, rootdir, logDir,
>           oldLogDir, this.fs);
>         splitter.splitLog();
>       }
>       splitTime = splitter.getTime();
>       splitLogSize = splitter.getSize();
>     } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>       master.abort("Shutting down HBase cluster: Failed splitting hlog files...", e);
>     } finally {
>       this.splitLogLock.unlock();
>     }
> And it was really give some useful help to some extent, while the namenode process exited
or been killed, but not considered the Namenode safemode exception.
>    I think the root reason is the method of checkFileSystem().
>    It gives out an method to check whether the HDFS works normally(Read and write could
be success), and that maybe the original propose of this method. This's how this method implements:
>     DistributedFileSystem dfs = (DistributedFileSystem) fs;
>     try {
>       if (dfs.exists(new Path("/"))) {  
>         return;
>       }
>     } catch (IOException e) {
>       exception = RemoteExceptionHandler.checkIOException(e);
>     }
>    I have check the hdfs code, and learned that while the namenode was in safemode ,the
dfs.exists(new Path("/")) returned true. Because the file system could provide read-only service.
So this method just checks the dfs whether could be read. I think it's not reasonable.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message