hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Francke (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-3179) HBase Handler doesn't handle NULLs properly
Date Fri, 22 Jun 2012 13:55:42 GMT

     [ https://issues.apache.org/jira/browse/HIVE-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lars Francke updated HIVE-3179:
-------------------------------

    Attachment: HIVE-3179.1.patch

The attached patch fixes the problem as well as changes a unit test that actually tests this
behavior. The unit test fails if our fix to {{LazyHBaseRow}} is not applied.

We're not sure if this is the best way to fix this problem as it circumvents the optimization
being done by the fieldsInited field. Ideally instead of returning null on an empty HBase
cell this would insert some kind of marker but adding an empty ByteArrayRef is not interpreted
as NULL but as an empty value (which makes sense).

In short: This fixes the bug at the cost of some performance for NULL (non-existing) fields
in HBase.
                
> HBase Handler doesn't handle NULLs properly
> -------------------------------------------
>
>                 Key: HIVE-3179
>                 URL: https://issues.apache.org/jira/browse/HIVE-3179
>             Project: Hive
>          Issue Type: Bug
>          Components: HBase Handler
>    Affects Versions: 0.9.0
>            Reporter: Lars Francke
>            Priority: Critical
>         Attachments: HIVE-3179.1.patch
>
>
> We found a quite severe issue in the HBase Handler which actually means that Hive potentially
returns incorrect data if a column has NULL values in HBase (which means the cell doesn't
even exist)
> In HBase Shell:
> {noformat}
> create 'hive_hbase_test', 'test'
> put 'hive_hbase_test', '1', 'test:c1', 'c1-1'
> put 'hive_hbase_test', '1', 'test:c2', 'c2-1'
> put 'hive_hbase_test', '1', 'test:c3', 'c3-1'
> put 'hive_hbase_test', '2', 'test:c1', 'c1-2'
> {noformat}
> In Hive:
> {noformat}
> DROP TABLE IF EXISTS hive_hbase_test;
> CREATE EXTERNAL TABLE hive_hbase_test (
>   id int,
>   c1 string,
>   c2 string,
>   c3 string
> )
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" =
> ":key#s,test:c1#s,test:c2#s,test:c3#s")
> TBLPROPERTIES("hbase.table.name" = "hive_hbase_test");
> hive> select * from hive_hbase_test;
> OK
> 1	c1-1	c2-1	c3-1
> 2	c1-2	NULL	NULL
> hive> select c1 from hive_hbase_test;
> c1-1
> c1-2
> hive> select c1, c2 from hive_hbase_test;
> c1-1	c2-1
> c1-2	NULL
> {noformat}
> So far everything is correct but now:
> {noformat}
> hive> select c1, c2, c2 from hive_hbase_test;
> c1-1	c2-1	c2-1
> c1-2	NULL	c2-1
> {noformat}
> Selecting c2 twice works the first time but the second time we
> actually get the value from the previous row.
> {noformat}
> hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test;
> c1-1	c3-1	c2-1	c2-1	c3-1	c3-1	c1-1
> c1-2	NULL	NULL	c2-1	c3-1	c3-1	c1-2
> {noformat}
> We've narrowed this down to an early initialization of {{fieldsInited\[fieldID] = true}}
in {{LazyHBaseRow#uncheckedGetField}} and we'll try to provide a patch which surely needs
review.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message