Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 8 Jun 2017 19:01:18 +0000 (UTC)
From: "Vincent Poon (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.13076058.1496194244000.31448.1496948478425@Atlassian.JIRA>
In-Reply-To: <JIRA.13076058.1496194244000@Atlassian.JIRA>
References: <JIRA.13076058.1496194244000@Atlassian.JIRA> <JIRA.13076058.1496194244782@jira-lw-us.apache.org>
Subject: [jira] [Updated] (HBASE-18137) Replication gets stuck for empty
 WALs
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Thu, 08 Jun 2017 19:01:24 -0000


     [ https://issues.apache.org/jira/browse/HBASE-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vincent Poon updated HBASE-18137:
---------------------------------
    Attachment: HBASE-18137.branch-1.3.v2.patch

Added a check for 0 length

So we only dump the current file and move on if we get EOFException, the length is 0, and there are WALs in the queue behind this one (we assume that means the current WAL is closed and therefore there really is no data).

> Replication gets stuck for empty WALs
> -------------------------------------
>
>                 Key: HBASE-18137
>                 URL: https://issues.apache.org/jira/browse/HBASE-18137
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.3.1
>            Reporter: Ashu Pachauri
>            Assignee: Vincent Poon
>            Priority: Critical
>             Fix For: 2.0.0, 1.4.0, 1.3.2, 1.1.11, 1.2.7
>
>         Attachments: HBASE-18137.branch-1.3.v1.patch, HBASE-18137.branch-1.3.v2.patch
>
>
> Replication assumes that only the last WAL of a recovered queue can be empty. But, intermittent DFS issues may cause empty WALs being created (without the PWAL magic), and a roll of WAL to happen without a regionserver crash. This will cause recovered queues to have empty WALs in the middle. This cause replication to get stuck:
> {code}
> TRACE regionserver.ReplicationSource: Opening log <wal_file>
> WARN regionserver.ReplicationSource: <peer_cluster_id>-<recovered_queue> Got: 
> java.io.EOFException
> 	at java.io.DataInputStream.readFully(DataInputStream.java:197)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> 	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
> 	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> 	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {code}
> The WAL in question was completely empty but there were other WALs in the recovered queue which were newer and non-empty.


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)