Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E95D411123 for ; Wed, 3 Sep 2014 04:17:52 +0000 (UTC) Received: (qmail 25871 invoked by uid 500); 3 Sep 2014 04:17:52 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 25728 invoked by uid 500); 3 Sep 2014 04:17:52 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 25334 invoked by uid 99); 3 Sep 2014 04:17:52 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Sep 2014 04:17:52 +0000 Date: Wed, 3 Sep 2014 04:17:51 +0000 (UTC) From: "Andrew Purtell (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (HBASE-11868) Data loss in hlog when the hdfs is unavailable MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell resolved HBASE-11868. ------------------------------------ Resolution: Fixed Fix Version/s: (was: 0.98.7) 0.98.6 Applied v2 patch and pushed to 0.98. Tests pass locally. > Data loss in hlog when the hdfs is unavailable > ---------------------------------------------- > > Key: HBASE-11868 > URL: https://issues.apache.org/jira/browse/HBASE-11868 > Project: HBase > Issue Type: Bug > Affects Versions: 0.98.5 > Reporter: Liu Shaohui > Assignee: Liu Shaohui > Priority: Blocker > Fix For: 0.98.6 > > Attachments: HBASE-11868-0.98-v1.diff, HBASE-11868-0.98-v2.diff > > > When using the new thread model in hbase 0.98, we found a bug which may cause data loss when the the hdfs is unavailable. > When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog first call appendNoSync to write the edits to hlog and then call sync with txid. > Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 9 and the failedTxid is 0. When the the hdfs is unavailable, the AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they will update the syncedTillHere to 10 and the failedTxid to 10. > When the hlog calls the sync with txid :10, the failedTxid will nerver be checked for txid equals with syncedTillHere. The client thinks the write success , but the data only be writtten to memstore not hlog. If the regionserver is down later before the memstore is flushed, the data will be lost. > See: FSHLog.java #1348 > {code} > // sync all transactions upto the specified txid > private void syncer(long txid) throws IOException { > synchronized (this.syncedTillHere) { > while (this.syncedTillHere.get() < txid) { > try { > this.syncedTillHere.wait(); > if (txid <= this.failedTxid.get()) { > assert asyncIOE != null : > "current txid is among(under) failed txids, but asyncIOE is null!"; > throw asyncIOE; > } > } catch (InterruptedException e) { > LOG.debug("interrupted while waiting for notification from AsyncNotifier"); > } > } > } > } > {code} > We can fix this issue by moving the comparing of txid and failedTxid outside the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)