Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0A342200C8E for ; Thu, 8 Jun 2017 21:01:24 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 08D37160BD5; Thu, 8 Jun 2017 19:01:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4F9D3160BC3 for ; Thu, 8 Jun 2017 21:01:23 +0200 (CEST) Received: (qmail 2843 invoked by uid 500); 8 Jun 2017 19:01:22 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 2832 invoked by uid 99); 8 Jun 2017 19:01:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jun 2017 19:01:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id ED9F018060B for ; Thu, 8 Jun 2017 19:01:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 8X_SgBn4uBtD for ; Thu, 8 Jun 2017 19:01:21 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 754905FD6D for ; Thu, 8 Jun 2017 19:01:20 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 80876E0D28 for ; Thu, 8 Jun 2017 19:01:19 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 6859821E16 for ; Thu, 8 Jun 2017 19:01:18 +0000 (UTC) Date: Thu, 8 Jun 2017 19:01:18 +0000 (UTC) From: "Vincent Poon (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-18137) Replication gets stuck for empty WALs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 08 Jun 2017 19:01:24 -0000 [ https://issues.apache.org/jira/browse/HBASE-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent Poon updated HBASE-18137: --------------------------------- Attachment: HBASE-18137.branch-1.3.v2.patch Added a check for 0 length So we only dump the current file and move on if we get EOFException, the length is 0, and there are WALs in the queue behind this one (we assume that means the current WAL is closed and therefore there really is no data). > Replication gets stuck for empty WALs > ------------------------------------- > > Key: HBASE-18137 > URL: https://issues.apache.org/jira/browse/HBASE-18137 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 1.3.1 > Reporter: Ashu Pachauri > Assignee: Vincent Poon > Priority: Critical > Fix For: 2.0.0, 1.4.0, 1.3.2, 1.1.11, 1.2.7 > > Attachments: HBASE-18137.branch-1.3.v1.patch, HBASE-18137.branch-1.3.v2.patch > > > Replication assumes that only the last WAL of a recovered queue can be empty. But, intermittent DFS issues may cause empty WALs being created (without the PWAL magic), and a roll of WAL to happen without a regionserver crash. This will cause recovered queues to have empty WALs in the middle. This cause replication to get stuck: > {code} > TRACE regionserver.ReplicationSource: Opening log > WARN regionserver.ReplicationSource: - Got: > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readFully(DataInputStream.java:169) > at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915) > at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880) > at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1829) > at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1843) > at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {code} > The WAL in question was completely empty but there were other WALs in the recovered queue which were newer and non-empty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)