hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Travis Crawford (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2950) Hive should store the full table schema in partition storage descriptors
Date Thu, 07 Jun 2012 21:44:23 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291338#comment-13291338
] 

Travis Crawford commented on HIVE-2950:
---------------------------------------

Updated the branch against trunk and reran the tests.

https://travis.ci.cloudbees.com/job/HIVE-2950_partition_full_schema/1/

There were some test failures in CI so I ran these locally and they passed:

{code}
ant clean package test -Dtestcase=TestCliDriver -Dqfile=create_view_partitioned.q
ant clean package test -Dtestcase=TestNegativeCliDriver -Dqfile=create_or_replace_view1.q
ant clean package test -Dtestcase=TestNegativeCliDriver -Dqfile=create_or_replace_view2.q
ant clean package test -Dtestcase=TestHiveServerSessions
{code}
                
> Hive should store the full table schema in partition storage descriptors
> ------------------------------------------------------------------------
>
>                 Key: HIVE-2950
>                 URL: https://issues.apache.org/jira/browse/HIVE-2950
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Travis Crawford
>            Assignee: Travis Crawford
>         Attachments: HIVE-2950.D2769.1.patch
>
>
> Hive tables have a schema, which is copied into the partition storage descriptor when
adding a partition. Currently only columns stored in the table storage descriptor are copied
- columns that are reported by the serde are not copied. Instead of copying the table storage
descriptor columns into the partition columns, the full table schema should be copied.
> DETAILS
> This is a little long but is necessary to show 3 things: current behavior when explicitly
listing columns, behavior with HIVE-2941 patched in and serde reported columns, and finally
the behavior with this patch (full table schema copied into the partition storage descriptor).
> Here's an example of what currently happens. Note the following:
> * the two manually-defined fields defined for the table are listed in the table storage
descriptor.
> * both fields are present in the partition storage descriptor
> This works great because users who query for a partition can look at its storage descriptor
and get the schema.
> {code}
> hive> create external table foo_test (name string, age int) partitioned by (part_dt
string);
> hive> describe extended foo_test;
> OK
> name	string	
> age	int	
> part_dt	string	
> 	 	 
> Detailed Table Information	Table(tableName:foo_test, dbName:travis_test, owner:travis,
createTime:1334256062, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:name,
type:string, comment:null), FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt,
type:string, comment:null)], location:hdfs://foo.com/warehouse/travis_test.db/foo_test, inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false,
numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:,
secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)],
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256062}, viewOriginalText:null, viewExpandedText:null,
tableType:EXTERNAL_TABLE)	
> Time taken: 0.082 seconds
> hive> alter table foo_test add partition (part_dt = '20120331T000000Z') location 'hdfs://foo.com/foo/2012/03/31/00';
> hive> describe extended foo_test partition (part_dt = '20120331T000000Z');
> OK
> name	string	
> age	int	
> part_dt	string	
> 	 	 
> Detailed Partition Information	Partition(values:[20120331T000000Z], dbName:travis_test,
tableName:foo_test, createTime:1334256131, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name,
type:string, comment:null), FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt,
type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false,
numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:,
secondaryRegions:[]), parameters:{transient_lastDdlTime=1334256131})	
> {code}
> CURRENT BEHAVIOR WITH HIVE-2941 PATCHED IN
> Now let's examine what happens when creating a table when the serde reports the schema.
Notice the following:
> * The table storage descriptor contains an empty list of columns. However, the table
schema is available from the serde reflecting on the serialization class.
> * The partition storage descriptor does contain a single "part_dt" column that was copied
from the table partition keys. The actual data columns are not present.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_dt string)
row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with serdeproperties ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person")
stored as inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.08 seconds
> hive> describe extended person_test;
> OK
> name	struct<first_name:string,last_name:string>	from deserializer
> id	int	from deserializer
> email	string	from deserializer
> phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
> part_dt	string	
> 	 	 
> Detailed Table Information	Table(tableName:person_test, dbName:travis_test, owner:travis,
createTime:1334256942, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:hdfs://foo.com/warehouse/travis_test.db/person_test,
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}),
bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt,
type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256942},
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)	
> Time taken: 0.147 seconds
> hive> alter table person_test add partition (part_dt = '20120331T000000Z') location
'hdfs://foo.com/foo/2012/03/31/00'; 
> OK
> Time taken: 0.149 seconds
> hive> describe extended person_test partition (part_dt = '20120331T000000Z');
> OK
> part_dt	string	
> 	 	 
> Detailed Partition Information	Partition(values:[20120331T000000Z], dbName:travis_test,
tableName:person_test, createTime:1334257029, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:part_dt,
type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false,
numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}),
bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), parameters:{transient_lastDdlTime=1334257029})

> Time taken: 0.106 seconds
> hive> 
> {code}
> PROPOSED BEHAVIOR
> I believe the correct thing to do is copy the full table schema (serde-reported columns
+ partition keys) into the partition storage descriptor. Notice the following:
> * Table storage descriptor does not contain any columns, because they are reported by
the serde.
> * Partition storage descriptor now contains both the serde-reported schema, and full
table schema.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_dt string)
row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with serdeproperties ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person")
stored as inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.076 seconds
> hive> describe extended person_test;                                             
                                                                                       OK
                                                                                         
                                                                            name    struct<first_name:string,last_name:string>
     from deserializer
> id	int	from deserializer
> email	string	from deserializer
> phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
> part_dt	string	
> 	 	 
> Detailed Table Information	Table(tableName:person_test, dbName:travis_test, owner:travis,
createTime:1334257489, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:hdfs://foo.com/warehouse/travis_test.db/person_test,
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}),
bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt,
type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334257489},
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)	
> Time taken: 0.155 seconds
> hive> alter table person_test add partition (part_dt = '20120331T000000Z') location
'hdfs://foo.com/foo/2012/03/31/00';
> OK                                                                                  
                                                                                    Time taken:
0.296 seconds                                        
> hive> describe extended person_test partition (part_dt = '20120331T000000Z');    
                                                                                       OK
                                                                                         
                                                                            name    struct<first_name:string,last_name:string>
     from deserializer
> id	int	from deserializer
> email	string	from deserializer
> phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
> part_dt	string	
> 	 	 
> Detailed Partition Information	Partition(values:[20120331T000000Z], dbName:travis_test,
tableName:person_test, createTime:1334257504, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name,
type:struct<first_name:string,last_name:string>, comment:from deserializer), FieldSchema(name:id,
type:int, comment:from deserializer), FieldSchema(name:email, type:string, comment:from deserializer),
FieldSchema(name:phones, type:array<struct<number:string,type:struct<value:int>>>,
comment:from deserializer), FieldSchema(name:part_dt, type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00,
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}),
bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), parameters:{transient_lastDdlTime=1334257504})

> Time taken: 0.133 seconds
> hive> 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message