hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars Francke <>
Subject Problem with NULLs in HBase "leaking" into following rows
Date Thu, 21 Jun 2012 15:42:55 GMT

we're using the HBase integration in Hive 0.9 and are running into
problems when there are rows with NULL values (which would map to a
non-existing cell in HBase).

We're using a UDF[1] but see the same behavior without it.

Just as an example table we have just two rows

In HBase Shell:

create 'hive_hbase_test', 'test'
put 'hive_hbase_test', '1', 'test:c1', 'c1-1'
put 'hive_hbase_test', '1', 'test:c2', 'c2-1'
put 'hive_hbase_test', '1', 'test:c3', 'c3-1'
put 'hive_hbase_test', '2', 'test:c1', 'c1-2'

In Hive:

DROP TABLE IF EXISTS hive_hbase_test;
CREATE EXTERNAL TABLE hive_hbase_test (
  id int,
  c1 string,
  c2 string,
  c3 string
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
TBLPROPERTIES("" = "hive_hbase_test");

hive> select * from hive_hbase_test;
1	c1-1	c2-1	c3-1
2	c1-2	NULL	NULL

hive> select c1 from hive_hbase_test;

hive> select c1, c2 from hive_hbase_test;
c1-1	c2-1
c1-2	NULL

So far everything is correct but now:

hive> select c1, c2, c2 from hive_hbase_test;
c1-1	c2-1	c2-1
c1-2	NULL	c2-1

Selecting c2 twice works the first time but the second time we
actually get the value from the previous row.

hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test;
c1-1	c3-1	c2-1	c2-1	c3-1	c3-1	c1-1
c1-2	NULL	NULL	c2-1	c3-1	c3-1	c1-2

This works with a "native" HDFS backed table.

In our UDF we were started logging (this UDF gets a year, month and
day and any of those might be null) and tested a simple two row table.

hive> SELECT id, year, month, parseDate(year, month, day) FROM

First row (data in HBase, 1997-1-1):
deferred: [1997] - convertedObject: [1997]
deferred: [1] - convertedObject: [1]
deferred: [1] - convertedObject: [1]
Year: [1997], Month: [1], Day: [1]

Second row (data in HBase: 2006-null-null):
deferred: [2006] - convertedObject: [2006]
deferred: [1] - convertedObject: [1]
deferred: [1] - convertedObject: [null]
Year: [2006], Month: [1], Day: [null]

I know this looks very confusing and I hope I haven't overdone it with
the examples but this seems like a rather serious problem with the
HBase integration. Values from previous rows are "leaking" into null
values in following rows. We're not 100% sure if we're doing something
wrong but I don't see what we could do wrong here. I'll open an issue
if no one has an idea what's going on here. Tried looking at the HBase
Handler code but was confused by it. Will try again tomorrow.

Thanks for bearing with me.


[1] I would very much appreciate a review of our usage of
DeferredObjects etc.:

View raw message