hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent Poon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18137) Replication gets stuck for empty WALs
Date Thu, 08 Jun 2017 05:53:18 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042239#comment-16042239

Vincent Poon commented on HBASE-18137:

[~anoop.hbase] The master code is completely different, as the logic has been refactored and
moved to ReplicationSourceWALReaderThread.  We would basically need to do something similar
there - if sleepMultiplier hits the max due to EOFException, we can force the WALEntryStream
to move onto the next log in queue.

Thinking on this some more, we can an additional check of the file length, and only consider
dumping if the length is 0.  Is it possible for dfs to report a length of 0 when there's actually
data somewhere?  

> Replication gets stuck for empty WALs
> -------------------------------------
>                 Key: HBASE-18137
>                 URL: https://issues.apache.org/jira/browse/HBASE-18137
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.3.1
>            Reporter: Ashu Pachauri
>            Assignee: Vincent Poon
>            Priority: Critical
>             Fix For: 2.0.0, 1.4.0, 1.3.2, 1.1.11, 1.2.7
>         Attachments: HBASE-18137.branch-1.3.v1.patch
> Replication assumes that only the last WAL of a recovered queue can be empty. But, intermittent
DFS issues may cause empty WALs being created (without the PWAL magic), and a roll of WAL
to happen without a regionserver crash. This will cause recovered queues to have empty WALs
in the middle. This cause replication to get stuck:
> {code}
> TRACE regionserver.ReplicationSource: Opening log <wal_file>
> WARN regionserver.ReplicationSource: <peer_cluster_id>-<recovered_queue>
> java.io.EOFException
> 	at java.io.DataInputStream.readFully(DataInputStream.java:197)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> 	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
> 	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> 	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {code}
> The WAL in question was completely empty but there were other WALs in the recovered queue
which were newer and non-empty.

This message was sent by Atlassian JIRA

View raw message