hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent Poon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-18137) Replication gets stuck for empty WALs
Date Wed, 07 Jun 2017 22:13:18 GMT

     [ https://issues.apache.org/jira/browse/HBASE-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Vincent Poon updated HBASE-18137:
    Attachment: HBASE-18137.branch-1.3.v1.patch

So it seems the code was already designed to handle a case like this.  I've attached a test
that reproduces the issue, as well as a 1 line fix.

In ReplicationSource#openReader() , we catch IOExceptions.  If it's an EOFException and there
is no other log in the queue, then we do nothing because it might be the case that there are
simply no entries yet.

But if there's another log in the queue, then we log a warning and increase the sleepMultiplier,
and sleep.  We have a configurable max # of retries, and if it hits this max number, we log
"Waited too long for this file, considering dumping"
and call processEndOfFIle() and move on to the next log.

The problem is that the sleepMultiplier was set to 1 at the top of the loop, so we would never
hit the max # of retries.

I'd like to see what others think about this change.  The main assumption is that if you have
another log in the queue, you can presume the current log to be closed.
[~apurtell] [~busbey]

> Replication gets stuck for empty WALs
> -------------------------------------
>                 Key: HBASE-18137
>                 URL: https://issues.apache.org/jira/browse/HBASE-18137
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.3.1
>            Reporter: Ashu Pachauri
>            Assignee: Vincent Poon
>            Priority: Critical
>             Fix For: 2.0.0, 1.4.0, 1.3.2, 1.1.11, 1.2.7
>         Attachments: HBASE-18137.branch-1.3.v1.patch
> Replication assumes that only the last WAL of a recovered queue can be empty. But, intermittent
DFS issues may cause empty WALs being created (without the PWAL magic), and a roll of WAL
to happen without a regionserver crash. This will cause recovered queues to have empty WALs
in the middle. This cause replication to get stuck:
> {code}
> TRACE regionserver.ReplicationSource: Opening log <wal_file>
> WARN regionserver.ReplicationSource: <peer_cluster_id>-<recovered_queue>
> java.io.EOFException
> 	at java.io.DataInputStream.readFully(DataInputStream.java:197)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> 	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
> 	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> 	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {code}
> The WAL in question was completely empty but there were other WALs in the recovered queue
which were newer and non-empty.

This message was sent by Atlassian JIRA

View raw message