Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 544C619575 for ; Thu, 14 Apr 2016 04:00:29 +0000 (UTC) Received: (qmail 3212 invoked by uid 500); 14 Apr 2016 04:00:29 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 3161 invoked by uid 500); 14 Apr 2016 04:00:29 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 3136 invoked by uid 99); 14 Apr 2016 04:00:29 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Apr 2016 04:00:29 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id F1EF22C1F56 for ; Thu, 14 Apr 2016 04:00:28 +0000 (UTC) Date: Thu, 14 Apr 2016 04:00:28 +0000 (UTC) From: "Hadoop QA (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-15236) Inconsistent cell reads over multiple bulk-loaded HFiles MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240549#comment-15240549 ] Hadoop QA commented on HBASE-15236: ----------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s {color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 29s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s {color} | {color:green} master passed with JDK v1.8.0 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s {color} | {color:green} master passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 5m 21s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 5s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s {color} | {color:green} master passed with JDK v1.8.0 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s {color} | {color:green} master passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 49s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 52s {color} | {color:green} the patch passed with JDK v1.8.0 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 52s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 5m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 29m 51s {color} | {color:green} Patch does not cause any errors with Hadoop 2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s {color} | {color:green} the patch passed with JDK v1.8.0 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 104m 6s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 159m 50s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Timed out junit tests | org.apache.hadoop.hbase.snapshot.TestMobFlushSnapshotFromClient | \\ \\ || Subsystem || Report/Notes || | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12798623/HBASE-15236-master-v8.patch | | JIRA Issue | HBASE-15236 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux pietas.apache.org 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh | | git revision | master / 5a7c8dc | | Default Java | 1.7.0_79 | | Multi-JDK versions | /home/jenkins/tools/java/jdk1.8.0:1.8.0 /usr/local/jenkins/java/jdk1.7.0_79:1.7.0_79 | | findbugs | v3.0.0 | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/1406/artifact/patchprocess/patch-unit-hbase-server.txt | | unit test logs | https://builds.apache.org/job/PreCommit-HBASE-Build/1406/artifact/patchprocess/patch-unit-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/1406/testReport/ | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/1406/console | | Powered by | Apache Yetus 0.2.1 http://yetus.apache.org | This message was automatically generated. > Inconsistent cell reads over multiple bulk-loaded HFiles > -------------------------------------------------------- > > Key: HBASE-15236 > URL: https://issues.apache.org/jira/browse/HBASE-15236 > Project: HBase > Issue Type: Bug > Reporter: Appy > Assignee: Appy > Attachments: HBASE-15236-master-v2.patch, HBASE-15236-master-v3.patch, HBASE-15236-master-v4.patch, HBASE-15236-master-v5.patch, HBASE-15236-master-v6.patch, HBASE-15236-master-v7.patch, HBASE-15236-master-v8.patch, HBASE-15236.patch, TestWithSingleHRegion.java, tables_data.zip > > > If there are two bulkloaded hfiles in a region with same seqID, same timestamps and duplicate keys*, get and scan may return different values for a key. Not sure how this would happen, but one of our customer uploaded a dataset with 2 files in a single region and both having same bulk load timestamp. These files are small ~50M (I couldn't find any setting for max file size that could lead to 2 files). The range of keys in two hfiles are overlapping to some extent, but not fully (so the two files are because of region merge). > In such a case, depending on file sizes (used in StoreFile.Comparators.SEQ_ID), we may get different values for the same cell (say "r", "cf:50") depending on what we call: get "r" "cf:50" or get "r" "cf:". > To replicate this, i ran ImportTsv twice with 2 different csv files but same timestamp (-Dimporttsv.timestamp=1). Collected two resulting hfiles in a single directory and loaded with LoadIncrementalHFiles. Here are the commands. > {noformat} > //CSV files should be in hdfs:///user/hbase/tmp > sudo -u hbase hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,c:1,c:2,c:3,c:4,c:5,c:6 -Dimporttsv.separator='|' -Dimporttsv.timestamp=1 -Dimporttsv.bulk.output=hdfs:///user/hbase/1 t tmp > {noformat} > tables_data.zip contains three tables: t, t2, and t3. > To load these tables and test with them, use the attached TestWithSingleHRegion.java. (Change ROOT_DIR and TABLE_NAME variables appropriately) > Hfiles for table 't': > 1) 07922ac90208445883b2ddc89c424c89 : length = 1094: KVs = (c:5, 55) (c:6, 66) > 2) 143137fa4fa84625b4e8d1d84a494f08 : length = 1192: KVs = (c:1, 1) (c:2, 2) (c:3, 3) (c:4, 4) (c:5, 5) (c:6, 6) > Get returns 5 where as Scan returns 55. > On the other hand if I make the first hfile, the one with values 55 and 66 bigger in size than second hfile, then: > Get returns 55 where as Scan returns 55. > This can be seen in table 't2'. > Weird things: > 1. Get consistently returns values from larger hfile. > 2. Scan consistently returns 55. > Reason: > Explanation 1: We add StoreFileScanners to heap by iterating over list of store files which is kept sorted in DefaultStoreFileManger using StoreFile.Comparators.SEQ_ID. It sorts the file first by increasing seq id, then by decreasing filesize. So our larger files sizes are always in the start. > Explanation 2: In KeyValueHeap.next(), if both current and this.heap.peek() scanners (type is StoreFileScanner in this case) point to KVs which compare as equal, we put back the current scanner into heap, and call pollRealKV() to get new one. While inserting, KVScannerComparator is used which compares the two (one already in heap and the one being inserted) as equal and inserts it after the existing one. \[1\] > Say s1 is scanner over first hfile and s2 over second. > 1. We scan with s2 to get values 1, 2, 3, 4 > 2. see that both scanners have c:5 as next key, put current scanner s2 back in heap, take s1 out > 4. return 55, advance s1 to key c:6 > 5. compare c:6 to c:5, put back s2 in heap since it has larger key, take out s1 > 6. ignore it's value (there is code for that too), advance s2 to c:6 > Go back to step 2 with both scanners having c:6 as next key > This is wrong because if we add a key between c:4 and c:5 to first hfile, say (c:44, foo) which makes s1 as the 'current' scanner when we hit step 2, then we'll get the values - 1, 2, 3, 4, foo, 5, 6. > (try table t3) > Fix: > Assign priority ids to StoreFileScanners when inserting them into heap, latest hfiles get higher priority, and use them in KVComparator instead of just seq id. > \[1\] [PriorityQueue.siftUp()|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/PriorityQueue.java#586] -- This message was sent by Atlassian JIRA (v6.3.4#6332)