hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-18137) Replication gets stuck for empty WALs
Date Mon, 05 Jun 2017 17:06:04 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037218#comment-16037218
] 

Andrew Purtell edited comment on HBASE-18137 at 6/5/17 5:05 PM:
----------------------------------------------------------------

bq. the workaround was to get a WAL file with just the header and then manually replace the
0 length files with header-but-no-edits file.

Maybe we want to handle this with a small enhancement to hbck that does this as a new fix
strategy. Start by ensuring we emit warnings when a replication queue stalls. Take over or
close HBASE-12125 in lieu of this issue. Implement. Then operators can fix replication queues
that have stalled due to HDFS level issues easily, but only by taking explicit action. There
can also be an option for less conservative folks who prefer automatic handling of this condition.
(There's the off chance the file's missing block actually exists and will be found once an
offline datanode is brought back online.)


was (Author: apurtell):
bq. the workaround was to get a WAL file with just the header and then manually replace the
0 length files with header-but-no-edits file.

Maybe we want to handle this with a small enhancement to hbck that does this as a new fix
strategy. Start by ensuring we emit warnings when a replication queue stalls. Take over or
close HBASE-12125 in lieu of this issue. Implement. Then operators can fix replication queues
that have stalled due to HDFS level issues easily, but only by taking explicit action. There
can also be an option for less conservative folks who prefer automatic handling of this condition.
(There's the off chance the file's missing block will be found once an offline datanode is
brought back online.)

> Replication gets stuck for empty WALs
> -------------------------------------
>
>                 Key: HBASE-18137
>                 URL: https://issues.apache.org/jira/browse/HBASE-18137
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.3.1
>            Reporter: Ashu Pachauri
>            Assignee: Vincent Poon
>            Priority: Critical
>             Fix For: 2.0.0, 1.4.0, 1.3.2, 1.1.11, 1.2.7
>
>
> Replication assumes that only the last WAL of a recovered queue can be empty. But, intermittent
DFS issues may cause empty WALs being created (without the PWAL magic), and a roll of WAL
to happen without a regionserver crash. This will cause recovered queues to have empty WALs
in the middle. This cause replication to get stuck:
> {code}
> TRACE regionserver.ReplicationSource: Opening log <wal_file>
> WARN regionserver.ReplicationSource: <peer_cluster_id>-<recovered_queue>
Got: 
> java.io.EOFException
> 	at java.io.DataInputStream.readFully(DataInputStream.java:197)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> 	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
> 	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> 	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {code}
> The WAL in question was completely empty but there were other WALs in the recovered queue
which were newer and non-empty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message