Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 87EBB18C88 for ; Sun, 6 Mar 2016 08:04:41 +0000 (UTC) Received: (qmail 59108 invoked by uid 500); 6 Mar 2016 08:04:41 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 59050 invoked by uid 500); 6 Mar 2016 08:04:41 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 58787 invoked by uid 99); 6 Mar 2016 08:04:41 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Mar 2016 08:04:41 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D1A522C1F5C for ; Sun, 6 Mar 2016 08:04:40 +0000 (UTC) Date: Sun, 6 Mar 2016 08:04:40 +0000 (UTC) From: "Clara Xiong (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-15400) Using Multiple Output for Date Tiered Compaction MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182042#comment-15182042 ] Clara Xiong commented on HBASE-15400: ------------------------------------- It is possible that major compaction includes n files but have to write out > n files. This could happen especially when we switch from default exploring compaction policy with a very large file. As for minor compaction, it should work for now. But if people change the behavior later, I am concerned this dependency may be lost. Yes, I have something to say about passing the boundaries and files together. I'd like to store the window info for minor compactions so we don't have to recompute. Please review my patch and let me know what you think. I didn't change anything related to compactor except the DateTieredStoreEngine. Almost all my work is on determining the files and boundaries to pass to the compactor. If you think it OK to update DateTieredCompactor to take DateTieredCompactionRequest, we can remove the DateTieredCompactionContext and just use DefaultCompactionContext. I updated and added test cases to make sure the logic for boundaries works. I did find bugs and fixed them. I left the testing for compactor to you, not sure whether you are done there yet. Please let me know. > Using Multiple Output for Date Tiered Compaction > ------------------------------------------------ > > Key: HBASE-15400 > URL: https://issues.apache.org/jira/browse/HBASE-15400 > Project: HBase > Issue Type: Sub-task > Components: Compaction > Reporter: Clara Xiong > Assignee: Clara Xiong > Fix For: 2.0.0 > > Attachments: HBASE-15400.patch > > > When we compact, we can output multiple files along the current window boundaries. There are two use cases: > 1. Major compaction: We want to output date tiered store files. > 2. Bulk load files and the old file generated by major compaction before upgrading to DTCP. > Pros: > 1. Restore locality, process versioning, updates and deletes while maintaining the tiered layout. > 2. The best way to fix a skewed layout. > > I am starting from a prototype of date tiered file writer from HBASE-15389 and will upload a patch soon. I have to call out a few design decisions: > 1. We only want to output the files along all windows for major compaction. > 2. For minor compaction, we don't want to output too many files, which will remain around because of current restriction of contiguous compaction by seq id. I will only output two files if all the files in the windows are being combined, one for the data within window and the other for the out-of-window tail. If there is any file in the window excluded from compaction, only one file will be output from compaction. When the windows are promoted, the situation of out of order data will gradually improve. > 3. We have to pass the boundaries with the list of store file as a complete time snapshot instead of two separate calls because window layout is determined by the time the computation is called. So we will need new type of compaction request. > 4. Since we will assign the same seq id for all output files, we need to sort by maxTimestamp subsequently. Right now all compaction policy gets the files sorted for StoreFileManager which sorts by seq id and other criteria. I will use this order for DTCP only, to avoid impacting other compaction policies. > 5. We need some cleanup of current design of StoreEngine and CompactionPolicy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)