hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pala M Muthaia (JIRA)" <>
Subject [jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
Date Wed, 09 Apr 2014 19:10:16 GMT


Pala M Muthaia commented on HIVE-6131:

Yes, one case where partition column is different than table column is the type of an existing
column is changed (e.g: from string to int) at the table level, after a partition is created.
So column1 is string on partition, but int on table.

That is the cause of failure for the unit test 'org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_partition_wise_fileformat11'
you see above.

So we can't alter the partition schema to match table schema, because underlying data in LazyBinaryColumnarSerde
was written to adhere to old schema. If you just change the schema, but not the data, the
data will be read incorrectly.

Yes serdes can handle additional columns by treating them as null valued. However, we are
trying to address a more complex case: e.g: Table and partition has 3 columns. We add one
more column at table level. Then at the partition, we append (not overwrite) data with 4th
column too. Now the partition has some data with 3 columns, but some data with 4 columns.
LazySimpleSerde can work in this scenario, but LazyBinaryColumnarSerde doesn't.  

> New columns after table alter result in null values despite data
> ----------------------------------------------------------------
>                 Key: HIVE-6131
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.11.0, 0.12.0, 0.13.0
>            Reporter: James Vaughan
>            Priority: Minor
>         Attachments: HIVE-6131.1.patch
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding columns to tables
with Partitions using 'REPLACE COLUMNS'.  I dug through the Jira a little bit and didn't see
anything for it so hopefully this isn't just noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to that partition,
it doesn't seem to recognize the extra data that actually exists in HDFS- as in, returns NULL
values on the new column despite having the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) partitioned by (day
> 2.  Create a simple file on the system with a couple of entries, something like "hi"
and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH '<FILEDIR>'
OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS (col1 string,
col2 string);
> 6.  Edit your file and add a second column using the default separator (ctrl+v, then
ctrl+a in Vim) and add two more entries, such as "hi3" on the first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hi	NULL	2014-01-02
> hi2	NULL	2014-01-02
> This is despite the fact that there is data in the file stored by the partition in HDFS.
> Let me know if you need any other information.  The only workaround for me currently
is to drop partitions for any I'm replacing data in and THEN reupload the new data file.
> Thanks,
> -James

This message was sent by Atlassian JIRA

View raw message