hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Duo Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19792) TestReplicationSmallTests.testDisableEnable fails
Date Mon, 15 Jan 2018 07:07:02 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16325937#comment-16325937
] 

Duo Zhang commented on HBASE-19792:
-----------------------------------

OK there is a critical bug which may leads to data inconsistency when replication.

This is the log.
{noformat}
2018-01-15 03:13:16,649 INFO  [asf911:59564Replication Statistics #0] regionserver.Replication$ReplicationStatisticsThread(352):
Normal source for cluster 2: Total replicated edits: 4, current progress:
walGroup [asf911.gq1.ygridcore.net%2C59564%2C1515985899707]: currently replicating from: /home/jenkins/jenkins-slave/workspace/HBase-Flaky-Tests/hbase-mapreduce/target/test-data/0eb32823-c0e3-4ebf-af26-f6fd89e2cbdd/asf911.gq1.ygridcore.net%2C59564%2C1515985899707.1515985992379
at position: 4871


2018-01-15 03:13:16,649 INFO  [asf911:56726Replication Statistics #0] regionserver.Replication$ReplicationStatisticsThread(352):
Normal source for cluster 2: Total replicated edits: 1003, current progress:
walGroup [asf911.gq1.ygridcore.net%2C56726%2C1515985899619]: currently replicating from: /home/jenkins/jenkins-slave/workspace/HBase-Flaky-Tests/hbase-mapreduce/target/test-data/0eb32823-c0e3-4ebf-af26-f6fd89e2cbdd/asf911.gq1.ygridcore.net%2C56726%2C1515985899619.1515985992379
at position: 122068


2018-01-15 03:13:16,937 DEBUG [RS_REFRESH_PEER-asf911:59564-0.replicationSource,2.replicationSource.asf911.gq1.ygridcore.net%2C59564%2C1515985899707,2.replicationSource.wal-reader.asf911.gq1.ygridcore.net%2C59564%2C1515985899707,2]
regionserver.ReplicationSourceWALReader(153): Failed to read stream of replication entries:
java.io.EOFException: Cannot seek after EOF
2018-01-15 03:13:16,937 DEBUG [RS_REFRESH_PEER-asf911:56726-0.replicationSource,2.replicationSource.asf911.gq1.ygridcore.net%2C56726%2C1515985899619,2.replicationSource.wal-reader.asf911.gq1.ygridcore.net%2C56726%2C1515985899619,2]
regionserver.ReplicationSourceWALReader(153): Failed to read stream of replication entries:
java.io.EOFException: Cannot seek after EOF
2018-01-15 03:13:17,940 ERROR [RS_REFRESH_PEER-asf911:59564-0.replicationSource,2.replicationSource.asf911.gq1.ygridcore.net%2C59564%2C1515985899707,2.replicationSource.wal-reader.asf911.gq1.ygridcore.net%2C59564%2C1515985899707,2]
regionserver.ReplicationSourceWALReader(156): Failed to read stream of replication entries
java.io.EOFException: Cannot seek after EOF
        at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1516)
        at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
        at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initInternal(ProtobufLogReader.java:209)
        at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initReader(ProtobufLogReader.java:171)
        at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:65)
        at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:166)
        at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:298)
        at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:280)
        at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:268)
        at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:403)
        at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:344)
        at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:295)
        at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:286)
        at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:171)
        at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:99)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:169)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:137)
2018-01-15 03:13:17,940 ERROR [RS_REFRESH_PEER-asf911:56726-0.replicationSource,2.replicationSource.asf911.gq1.ygridcore.net%2C56726%2C1515985899619,2.replicationSource.wal-reader.asf911.gq1.ygridcore.net%2C56726%2C1515985899619,2]
regionserver.ReplicationSourceWALReader(156): Failed to read stream of replication entries
java.io.EOFException: Cannot seek after EOF
        at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1516)
        at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
        at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initInternal(ProtobufLogReader.java:209)
        at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initReader(ProtobufLogReader.java:171)
        at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:65)
        at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:166)
        at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:298)
        at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:280)
        at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:268)
        at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:403)
        at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:344)
        at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:295)
        at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:286)
        at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:171)
        at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:99)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:169)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:137)
2018-01-15 03:13:17,941 WARN  [RS_REFRESH_PEER-asf911:59564-0.replicationSource,2.replicationSource.asf911.gq1.ygridcore.net%2C59564%2C1515985899707,2.replicationSource.wal-reader.asf911.gq1.ygridcore.net%2C59564%2C1515985899707,2]
regionserver.ReplicationSourceWALReader(211): Forcing removal of 0 length log in queue: /home/jenkins/jenkins-slave/workspace/HBase-Flaky-Tests/hbase-mapreduce/target/test-data/0eb32823-c0e3-4ebf-af26-f6fd89e2cbdd/asf911.gq1.ygridcore.net%2C59564%2C1515985899707.1515985992379
2018-01-15 03:13:17,941 WARN  [RS_REFRESH_PEER-asf911:56726-0.replicationSource,2.replicationSource.asf911.gq1.ygridcore.net%2C56726%2C1515985899619,2.replicationSource.wal-reader.asf911.gq1.ygridcore.net%2C56726%2C1515985899619,2]
regionserver.ReplicationSourceWALReader(211): Forcing removal of 0 length log in queue: /home/jenkins/jenkins-slave/workspace/HBase-Flaky-Tests/hbase-mapreduce/target/test-data/0eb32823-c0e3-4ebf-af26-f6fd89e2cbdd/asf911.gq1.ygridcore.net%2C56726%2C1515985899619.1515985992379
{noformat}

You can see that, the logs are not zero-length, we have actually replicated something. And
then we hit EOF, and then there is a force removal which says the files are both zero-length.

The code is like this:

{code}
    if (e instanceof EOFException ||
        e.getCause() instanceof EOFException && logQueue.size() > 1 &&
this.eofAutoRecovery) {
      try {
        if (fs.getFileStatus(logQueue.peek()).getLen() == 0) {
          LOG.warn("Forcing removal of 0 length log in queue: " + logQueue.peek());
          logQueue.remove();
          currentPosition = 0;
        }
      } catch (IOException ioe) {
        LOG.warn("Couldn't get file length information about log " + logQueue.peek());
      }
    }
{code}

The immediate cause here is that, these files are all opened for writing now so the getFileStatus
could return with a zero length if there is only one block, and then we removed the files
from log queue. Still not sure why this could lead to the FNFE later but we need to fix this...

I do not think the author want this happen since there is a 'logQueue.size() > 1' but I
think the priority of the operators causes the problem... We missed a pair of '()'... Let
me fix this issue first and then see whether the test still fails.

> TestReplicationSmallTests.testDisableEnable fails
> -------------------------------------------------
>
>                 Key: HBASE-19792
>                 URL: https://issues.apache.org/jira/browse/HBASE-19792
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Priority: Major
>             Fix For: 2.0.0-beta-2
>
>         Attachments: org.apache.hadoop.hbase.replication.TestReplicationSmallTests-output.txt
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message