hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Duo Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15252) Data loss when replaying wal if HDFS timeout
Date Thu, 11 Feb 2016 04:17:18 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142227#comment-15142227

Duo Zhang commented on HBASE-15252:

Changing the exception type back to IPBE can solve the problem(which cause the openHRegion
fail with an IOException) but I want to revisit the readNext method because I'm a little confusing
of how we deal with {{EOFException}}.

      } catch (EOFException eof) {
        LOG.trace("Encountered a malformed edit, seeking back to last good position in file",
        // If originalPosition is < 0, it is rubbish and we cannot use it (probably local
        if (originalPosition < 0) throw eof;
        // Else restore our position to original location in hope that next time through we
        // read successfully.
        return false;

Here we seek to the last good position, but we call “return false” instead of "continue".
This cause the {{next}} method of {{ReaderBase}} returns null and make the upper layer think
it has reached the end of file and close the current log reader. So what is purpose of the
seek here? And in fact, if the {{EOFException}} really means end of file, I do not think we
could read a valid wal entry successfully when retrying...


> Data loss when replaying wal if HDFS timeout
> --------------------------------------------
>                 Key: HBASE-15252
>                 URL: https://issues.apache.org/jira/browse/HBASE-15252
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>         Attachments: HBASE-15252-testcase.patch
> This is a problem introduced by HBASE-13825 where we change the exception type in catch
block in {{readNext}} method of {{ProtobufLogReader}}.
> {code:title=ProtobufLogReader.java}
>       try {
>           ......
>           ProtobufUtil.mergeFrom(builder, new LimitInputStream(this.inputStream, size),
>             (int)size);
>         } catch (IOException ipbe) { // <------ used to be InvalidProtocolBufferException
>           throw (EOFException) new EOFException("Invalid PB, EOF? Ignoring; originalPosition="
>             originalPosition + ", currentPosition=" + this.inputStream.getPos() +
>             ", messageSize=" + size + ", currentAvailable=" + available).initCause(ipbe);
>         }
> {code}
> Here if the {{inputStream}} throws an {{IOException}} due to timeout or something, we
just convert it to an {{EOFException}} and at the bottom of this method, we ignore {{EOFException}}
and return false. This cause the upper layer think we reach the end of file. So when replaying
we will treat the HDFS timeout error as a normal end of file and cause data loss.

This message was sent by Atlassian JIRA

View raw message