Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3C80511180 for ; Fri, 5 Sep 2014 04:25:45 +0000 (UTC) Received: (qmail 44863 invoked by uid 500); 5 Sep 2014 04:25:45 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 44821 invoked by uid 500); 5 Sep 2014 04:25:45 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 44809 invoked by uid 99); 5 Sep 2014 04:25:45 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Sep 2014 04:25:44 +0000 Date: Fri, 5 Sep 2014 04:25:44 +0000 (UTC) From: "Anoop Sam John (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Issue Comment Deleted] (HBASE-11772) Bulk load mvcc and seqId issues with native hfiles MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-11772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anoop Sam John updated HBASE-11772: ----------------------------------- Comment: was deleted (was: {quote}The above change implies that there may be more than one seqId in the filename. Can you give an example ? bq.e.g a hfile named 'abc_SeqId_10_' can exist in HBase and be DistCp out, then later bulk loaded into another HBase instance. {quote} If such a file is there, it could give a wrong judgement at below place and say the file is bulk loaded!!!! A matter of worry? {code} + String fileName = this.getPath().getName(); + int startPos = fileName.indexOf("SeqId_"); + if (startPos != -1) { + bulkLoadedHFile = true; + } {code} ) > Bulk load mvcc and seqId issues with native hfiles > -------------------------------------------------- > > Key: HBASE-11772 > URL: https://issues.apache.org/jira/browse/HBASE-11772 > Project: HBase > Issue Type: Bug > Affects Versions: 0.98.5 > Reporter: Jerry He > Assignee: Jerry He > Priority: Critical > Fix For: 0.99.0, 1.0.0, 2.0.0, 0.98.7 > > Attachments: HBASE-11772-0.98.patch, HBASE-11772-master-v1.patch > > > There are mvcc and seqId issues when bulk load native hfiles -- meaning hfiles that are direct file copy-out from hbase, not from HFileOutputFormat job. > There are differences between these two types of hfiles. > Native hfiles have possible non-zero MAX_MEMSTORE_TS_KEY value and non-zero mvcc values in cells. > Native hfiles also have MAX_SEQ_ID_KEY. > Native hfiles do not have BULKLOAD_TIME_KEY. > Here are a couple of problems I observed when bulk load native hfiles. > 1. Cells in newly bulk loaded hfiles can be invisible to scan. > It is easy to re-create. > Bulk load a native hfile that has a larger mvcc value in cells, e.g 10 > If the current readpoint when initiating a scan is less than 10, the cells in the new hfile are skipped, thus become invisible. > We don't reset the readpoint of a region after bulk load. > 2. The current StoreFile.isBulkLoadResult() is implemented as: > {code} > return metadataMap.containsKey(BULKLOAD_TIME_KEY) > {code} > which does not detect bulkloaded native hfiles. > 3. Another observed problem is possible data loss during log recovery. > It is similar to HBASE-10958 reported by [~jdcryans]. Borrow the re-create steps from HBASE-10958. > 1) Create an empty table > 2) Put one row in it (let's say it gets seqid 1) > 3) Bulk load one native hfile with large seqId ( e.g. 100). The native hfile can be obtained by copying out from existing table. > 4) Kill the region server that holds the table's region. > Scan the table once the region is made available again. The first row, at seqid 1, will be missing since the HFile with seqid 100 makes us believe that everything that came before it was flushed. > The problem 3 is probably related to 2. We will be ok if we get the appended seqId during bulk load instead of 100 from inside the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)