hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anoop Sam John (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18137) Replication gets stuck for empty WALs
Date Tue, 13 Jun 2017 01:15:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047307#comment-16047307
] 

Anoop Sam John commented on HBASE-18137:
----------------------------------------

bq. A new config "replication.source.eof.autorecovery"
Normally our configs will be prefixed with 'hbase.'  right?  Any reason its diff here? Or
just missed?  Current replication area configs are this way.  Did not check. Just asking as
noticed in the Release Notes.

> Replication gets stuck for empty WALs
> -------------------------------------
>
>                 Key: HBASE-18137
>                 URL: https://issues.apache.org/jira/browse/HBASE-18137
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.3.1
>            Reporter: Ashu Pachauri
>            Assignee: Vincent Poon
>            Priority: Critical
>             Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7
>
>         Attachments: HBASE-18137.branch-1.3.v1.patch, HBASE-18137.branch-1.3.v2.patch,
HBASE-18137.branch-1.3.v3.patch, HBASE-18137.branch-1.v1.patch, HBASE-18137.branch-1.v2.patch,
HBASE-18137.master.v1.patch
>
>
> Replication assumes that only the last WAL of a recovered queue can be empty. But, intermittent
DFS issues may cause empty WALs being created (without the PWAL magic), and a roll of WAL
to happen without a regionserver crash. This will cause recovered queues to have empty WALs
in the middle. This cause replication to get stuck:
> {code}
> TRACE regionserver.ReplicationSource: Opening log <wal_file>
> WARN regionserver.ReplicationSource: <peer_cluster_id>-<recovered_queue>
Got: 
> java.io.EOFException
> 	at java.io.DataInputStream.readFully(DataInputStream.java:197)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> 	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
> 	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> 	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {code}
> The WAL in question was completely empty but there were other WALs in the recovered queue
which were newer and non-empty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message