Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 60A3B200B9C for ; Mon, 10 Oct 2016 23:25:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5F8B7160AD1; Mon, 10 Oct 2016 21:25:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8E8E6160AEB for ; Mon, 10 Oct 2016 23:25:21 +0200 (CEST) Received: (qmail 69404 invoked by uid 500); 10 Oct 2016 21:25:20 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 69309 invoked by uid 99); 10 Oct 2016 21:25:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Oct 2016 21:25:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 7BDEE2C0D52 for ; Mon, 10 Oct 2016 21:25:20 +0000 (UTC) Date: Mon, 10 Oct 2016 21:25:20 +0000 (UTC) From: "Gary Helmling (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-16788) Race in compacted file deletion between HStore close() and closeAndArchiveCompactedFiles() MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 10 Oct 2016 21:25:22 -0000 [ https://issues.apache.org/jira/browse/HBASE-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15563563#comment-15563563 ] Gary Helmling commented on HBASE-16788: --------------------------------------- {quote} If I understand the code correctly, it would take longer for the close() to complete when concurrent CompactedHFilesDischargeHandler operation gets the archiveLock first. If this is not a concern, I am fine with your patch. {quote} It's true that close() may be blocked by the discharge chore thread if it is holding the archiveLock. But whether the work for archiving compacted HFiles is being done by the discharge thread or by close(), the same work needs to be done before close() can complete. So I don't expect this to appreciably change the time taken by close(). It just means that if close() is blocked by the discharger, it should be able to skip over the archive step once it gets to run. > Race in compacted file deletion between HStore close() and closeAndArchiveCompactedFiles() > ------------------------------------------------------------------------------------------ > > Key: HBASE-16788 > URL: https://issues.apache.org/jira/browse/HBASE-16788 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 1.3.0 > Reporter: Gary Helmling > Assignee: Gary Helmling > Priority: Blocker > Attachments: 16788-suggest.v2, HBASE-16788.001.patch, HBASE-16788.002.patch, HBASE-16788_1.patch > > > HBASE-13082 changed the way that compacted files are archived from being done inline on compaction completion to an async cleanup by the CompactedHFilesDischarger chore. It looks like the changes to HStore to support this introduced a race condition in the compacted HFile archiving. > In the following sequence, we can wind up with two separate threads trying to archive the same HFiles, causing a regionserver abort: > # compaction completes normally and the compacted files are added to {{compactedfiles}} in HStore's DefaultStoreFileManager > # *threadA*: CompactedHFilesDischargeHandler runs in a RS executor service, calling closeAndArchiveCompactedFiles() > ## obtains HStore readlock > ## gets a copy of compactedfiles > ## releases readlock > # *threadB*: calls HStore.close() as part of region close > ## obtains HStore writelock > ## calls DefaultStoreFileManager.clearCompactedfiles(), getting a copy of same compactedfiles > # *threadA*: calls HStore.removeCompactedfiles(compactedfiles) > ## archives files in {compactedfiles} in HRegionFileSystem.removeStoreFiles() > ## call HStore.clearCompactedFiles() > ## waits on write lock > # *threadB*: continues with close() > ## calls removeCompactedfiles(compactedfiles) > ## calls HRegionFIleSystem.removeStoreFiles() -> HFileArchiver.archiveStoreFiles() > ## receives FileNotFoundException because the files have already been archived by threadA > ## throws IOException > # RS aborts > I think the combination of fetching the compactedfiles list and removing the files needs to be covered by locking. Options I see are: > * Modify HStore.closeAndArchiveCompactedFiles(): use writelock instead of readlock and move the call to removeCompactedfiles() inside the lock. This means the read operations will be blocked while the files are being archived, which is bad. > * Synchronize closeAndArchiveCompactedFiles() and modify close() to call it instead of calling removeCompactedfiles() directly > * Add a separate lock for compacted files removal and use in closeAndArchiveCompactedFiles() and close() -- This message was sent by Atlassian JIRA (v6.3.4#6332)