Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 059F09F70 for ; Fri, 22 Jun 2012 12:28:45 +0000 (UTC) Received: (qmail 97257 invoked by uid 500); 22 Jun 2012 12:28:44 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 97211 invoked by uid 500); 22 Jun 2012 12:28:44 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 97201 invoked by uid 500); 22 Jun 2012 12:28:43 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 97177 invoked by uid 99); 22 Jun 2012 12:28:42 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Jun 2012 12:28:42 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id 6C60E14001F for ; Fri, 22 Jun 2012 12:28:42 +0000 (UTC) Date: Fri, 22 Jun 2012 12:28:42 +0000 (UTC) From: "Lars Francke (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: <746825210.43808.1340368122445.JavaMail.jiratomcat@issues-vm> In-Reply-To: <1321655161.43807.1340368002639.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Updated] (HIVE-3179) HBase Handler doesn't handle NULLs properly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Francke updated HIVE-3179: ------------------------------- Description: We found a quite severe issue in the HBase Handler which actually means that Hive potentially returns incorrect data if a column has NULL values in HBase (which means the cell doesn't even exist) In HBase Shell: {noformat} create 'hive_hbase_test', 'test' put 'hive_hbase_test', '1', 'test:c1', 'c1-1' put 'hive_hbase_test', '1', 'test:c2', 'c2-1' put 'hive_hbase_test', '1', 'test:c3', 'c3-1' put 'hive_hbase_test', '2', 'test:c1', 'c1-2' {noformat} In Hive: {noformat} DROP TABLE IF EXISTS hive_hbase_test; CREATE EXTERNAL TABLE hive_hbase_test ( id int, c1 string, c2 string, c3 string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#s,test:c1#s,test:c2#s,test:c3#s") TBLPROPERTIES("hbase.table.name" = "hive_hbase_test"); hive> select * from hive_hbase_test; OK 1 c1-1 c2-1 c3-1 2 c1-2 NULL NULL hive> select c1 from hive_hbase_test; c1-1 c1-2 hive> select c1, c2 from hive_hbase_test; c1-1 c2-1 c1-2 NULL {noformat} So far everything is correct but now: {noformat} hive> select c1, c2, c2 from hive_hbase_test; c1-1 c2-1 c2-1 c1-2 NULL c2-1 {noformat} Selecting c2 twice works the first time but the second time we actually get the value from the previous row. {noformat} hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test; c1-1 c3-1 c2-1 c2-1 c3-1 c3-1 c1-1 c1-2 NULL NULL c2-1 c3-1 c3-1 c1-2 {noformat} We've narrowed this down to an early initialization of {{fieldsInited\[fieldID] = true}} in {{LazyHBaseRow#uncheckedGetField}} and we'll try to provide a patch which surely needs review. was: We found a quite severe issue in the HBase Handler which actually means that Hive potentially returns incorrect data if a column has NULL values in HBase (which means the cell doesn't even exist) In HBase Shell: {noformat} create 'hive_hbase_test', 'test' put 'hive_hbase_test', '1', 'test:c1', 'c1-1' put 'hive_hbase_test', '1', 'test:c2', 'c2-1' put 'hive_hbase_test', '1', 'test:c3', 'c3-1' put 'hive_hbase_test', '2', 'test:c1', 'c1-2' {noformat} In Hive: {noformat} DROP TABLE IF EXISTS hive_hbase_test; CREATE EXTERNAL TABLE hive_hbase_test ( id int, c1 string, c2 string, c3 string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#s,test:c1#s,test:c2#s,test:c3#s") TBLPROPERTIES("hbase.table.name" = "hive_hbase_test"); hive> select * from hive_hbase_test; OK 1 c1-1 c2-1 c3-1 2 c1-2 NULL NULL hive> select c1 from hive_hbase_test; c1-1 c1-2 hive> select c1, c2 from hive_hbase_test; c1-1 c2-1 c1-2 NULL {noformat} So far everything is correct but now: {noformat} hive> select c1, c2, c2 from hive_hbase_test; c1-1 c2-1 c2-1 c1-2 NULL c2-1 {noformat} Selecting c2 twice works the first time but the second time we actually get the value from the previous row. {noformat} hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test; c1-1 c3-1 c2-1 c2-1 c3-1 c3-1 c1-1 c1-2 NULL NULL c2-1 c3-1 c3-1 c1-2 {noformat} We've narrowed this down to an early initialization of {{fieldsInited[fieldID] = true;}} in {{LazyHBaseRow#uncheckedGetField}} and we'll try to provide a patch which surely needs review. > HBase Handler doesn't handle NULLs properly > ------------------------------------------- > > Key: HIVE-3179 > URL: https://issues.apache.org/jira/browse/HIVE-3179 > Project: Hive > Issue Type: Bug > Components: HBase Handler > Affects Versions: 0.9.0 > Reporter: Lars Francke > Priority: Critical > > We found a quite severe issue in the HBase Handler which actually means that Hive potentially returns incorrect data if a column has NULL values in HBase (which means the cell doesn't even exist) > In HBase Shell: > {noformat} > create 'hive_hbase_test', 'test' > put 'hive_hbase_test', '1', 'test:c1', 'c1-1' > put 'hive_hbase_test', '1', 'test:c2', 'c2-1' > put 'hive_hbase_test', '1', 'test:c3', 'c3-1' > put 'hive_hbase_test', '2', 'test:c1', 'c1-2' > {noformat} > In Hive: > {noformat} > DROP TABLE IF EXISTS hive_hbase_test; > CREATE EXTERNAL TABLE hive_hbase_test ( > id int, > c1 string, > c2 string, > c3 string > ) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = > ":key#s,test:c1#s,test:c2#s,test:c3#s") > TBLPROPERTIES("hbase.table.name" = "hive_hbase_test"); > hive> select * from hive_hbase_test; > OK > 1 c1-1 c2-1 c3-1 > 2 c1-2 NULL NULL > hive> select c1 from hive_hbase_test; > c1-1 > c1-2 > hive> select c1, c2 from hive_hbase_test; > c1-1 c2-1 > c1-2 NULL > {noformat} > So far everything is correct but now: > {noformat} > hive> select c1, c2, c2 from hive_hbase_test; > c1-1 c2-1 c2-1 > c1-2 NULL c2-1 > {noformat} > Selecting c2 twice works the first time but the second time we > actually get the value from the previous row. > {noformat} > hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test; > c1-1 c3-1 c2-1 c2-1 c3-1 c3-1 c1-1 > c1-2 NULL NULL c2-1 c3-1 c3-1 c1-2 > {noformat} > We've narrowed this down to an early initialization of {{fieldsInited\[fieldID] = true}} in {{LazyHBaseRow#uncheckedGetField}} and we'll try to provide a patch which surely needs review. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira