Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0313210296 for ; Thu, 4 Dec 2014 22:28:14 +0000 (UTC) Received: (qmail 13841 invoked by uid 500); 4 Dec 2014 22:28:13 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 13802 invoked by uid 500); 4 Dec 2014 22:28:13 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 13789 invoked by uid 99); 4 Dec 2014 22:28:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Dec 2014 22:28:13 +0000 Date: Thu, 4 Dec 2014 22:28:13 +0000 (UTC) From: "Hadoop QA (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-12565) Race condition in HRegion.batchMutate() causes partial data to be written when region closes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-12565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234722#comment-14234722 ] Hadoop QA commented on HBASE-12565: ----------------------------------- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685147/hbase-12565.patch against master branch at commit 04444299ab8bf69618d7e07a6ec7071ce9234d9d. ATTACHMENT ID: 12685147 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 2073 checkstyle errors (more than the master's current 2072 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: + * A version of getRowLock(byte[], boolean) to use when a region operation has already been started + private void waitForCounter(MetricsWALSource source, String metricName, long expectedCount) throws InterruptedException { + fail(String.format("Timed out waiting for '%s' >= '%s', currentCount=%s", metricName, expectedCount, currentCount)); {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: {color:red}-1 core zombie tests{color}. There are 1 zombie test(s): at org.apache.hadoop.hbase.regionserver.TestAtomicOperation.testPutAndCheckAndPutInParallel(TestAtomicOperation.java:556) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11935//console This message is automatically generated. > Race condition in HRegion.batchMutate() causes partial data to be written when region closes > --------------------------------------------------------------------------------------------- > > Key: HBASE-12565 > URL: https://issues.apache.org/jira/browse/HBASE-12565 > Project: HBase > Issue Type: Bug > Components: Performance, regionserver > Affects Versions: 2.0.0, 0.98.6 > Reporter: Scott Fines > Attachments: hbase-12565.patch > > > The following sequence of events is possible to occur in HRegion's batchMutate() call: > 1. caller attempts to call HRegion.batchMutate() with a batch of N>1 records > 2. batchMutate acquires region lock in startRegionOperation, then calls doMiniBatchMutation() > 3. doMiniBatchMutation acquires one row lock > 4. Region closes > 5. doMiniBatchMutation attempts to acquire second row lock. > When this happens, the lock acquisition will also attempt to acquire the region lock, which fails (because the region is closing). At this stage, doMiniBatchMutation will stop writing further, BUT it WILL write data for the rows whose locks have already been acquired, and advance the index in MiniBatchOperationInProgress. Then, after it terminates successfully, batchMutate() will loop around a second time, and attempt AGAIN to acquire the region closing lock. When that happens, a NotServingRegionException is thrown back to the caller. > Thus, we have a race condition where partial data can be written when a region server is closing. > The main problem stems from the location of startRegionOperation() calls in batchMutate and doMiniBatchMutation(): > 1. batchMutate() reacquires the region lock with each iteration of the loop, which can cause some successful writes to occur, but then fail on others > 2. getRowLock() attempts to acquire the region lock once for each row, which allows doMiniBatchMutation to terminate early; this forces batchMutate() to use multiple iterations and results in condition 1 being hit. > There appears to be two parts to the solution as well: > 1. open an internal path so that doMiniBatchMutation() can acquire row locks without checking for region closure. This will have the added benefit of a significant performance improvement during large batch mutations. > 2. move the startRegionOperation() out of the loop in batchMutate() so that multiple iterations of doMiniBatchMutation will not cause the operation to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)