Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 167F9DC9D for ; Fri, 31 Aug 2012 21:58:08 +0000 (UTC) Received: (qmail 96728 invoked by uid 500); 31 Aug 2012 21:58:07 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 96681 invoked by uid 500); 31 Aug 2012 21:58:07 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 96663 invoked by uid 99); 31 Aug 2012 21:58:07 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Aug 2012 21:58:07 +0000 Date: Sat, 1 Sep 2012 08:58:07 +1100 (NCT) From: "Ted Yu (JIRA)" To: issues@hbase.apache.org Message-ID: <402018741.25196.1346450287814.JavaMail.jiratomcat@arcas> In-Reply-To: <590170163.37549.1345586918326.JavaMail.jiratomcat@arcas> Subject: [jira] [Updated] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-6630?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-6630: -------------------------- Attachment: 6590-seq-id-bulk-load.txt Amit's patch. =20 > Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded fil= es > -------------------------------------------------------------------------= -- > > Key: HBASE-6630 > URL: https://issues.apache.org/jira/browse/HBASE-6630 > Project: HBase > Issue Type: Sub-task > Reporter: Amitanand Aiyer > Assignee: Amitanand Aiyer > Priority: Minor > Attachments: 6590-seq-id-bulk-load.txt > > > Currently bulk loaded files are not assigned a sequence number. Thus, the= y can only be used to import historical data, dating to the past. There are= cases where we want to bulk load "current data"; but the bulk load mechani= sm does not support this, as the bulk loaded files are always sorted behind= the non-bulkloaded hfiles. Assigning Sequence Id to bulk loaded files shou= ld solve this issue. > StoreFiles within a store are sorted based on the sequenceId. SequenceId = is a monotonically increasing number that accompanies every edit written to= the WAL. For entries that update the same cell, we would like the latter e= dit to win. This comparision is accomplished using memstoreTS, at the KV le= vel; and sequenceId at the StoreFile level (to order scanners in the KeyVal= ueHeap). > BulkLoaded files are generated outside of HBase/RegionServer, so they do = not have a sequenceId written in the file. This causes HBase to lose track = of the point in time, when the BulkLoaded file was imported to HBase. Resul= ting in a behavior, that *only* supports viewing bulkLoaded files as files = back-filling data from the begining of time. > By assigning a sequence number to the file, we can allow the bulk loaded = file to fit in where we want. Either at the "current time" or the "begining= of time". The latter is the default, to maintain backward compatibility. > Design approach: > Store files keep track of the sequence Id in the trailer. Since we do not= wish to edit/rewrite the bulk loaded file upon import, we will encode the = assigned sequenceId into the fileName. The filename RegEx is updated for th= is regard. If the sequenceId is encoded in the filename, the sequenceId wil= l be used as the sequenceId for the file. If none is found, the sequenceId = will be considered 0 (as per the default, backward-compatible behavior). > To enable clients to request pre-existing behavior, the command line util= ity allows for 2 ways to import BulkLoaded Files: to assign or not assign a= sequence Number. > If a sequence Number is assigned, the imporeted file will be imported= with the "current sequence Id". > if the sequence Number is not assigned, it will be as if it was backf= illing old data, from the begining of time. > Compaction behavior: > With the current compaction algorithm, bulk loaded files =E2=80=93 th= at backfill data, to the begining of time =E2=80=93 can cause a compaction = storm, converting every minor compaction to a major compaction. To address = this, these files are excluded from minor compaction, based on a config par= am. (enabled for the messages use case). > Since, bulk loaded files that are not back-filling data do not cause = this issue, they will not be ignored during minor compactions based on the = config parameter. This is also required to ensure that there are no holes i= n the set of files selected for compaction =E2=80=93 this is necessary to p= reserve the order of KV's comparision before and after compaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira