hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2950) Hive should store the full table schema in partition storage descriptors
Date Mon, 02 Jul 2012 20:45:47 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405317#comment-13405317
] 

Ashutosh Chauhan commented on HIVE-2950:
----------------------------------------

Travis,
I liked your first patch better then this one. Semantics is that when you add partition, you
store *current* table's schema in Partition's storage descriptor (which is what your first
patch is doing). Your second approach will return the table schema for Partition at the read
time, by which time table's schema might have changed. I will test and commit your first patch.

                
> Hive should store the full table schema in partition storage descriptors
> ------------------------------------------------------------------------
>
>                 Key: HIVE-2950
>                 URL: https://issues.apache.org/jira/browse/HIVE-2950
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Travis Crawford
>            Assignee: Travis Crawford
>         Attachments: HIVE-2950.D2769.1.patch
>
>
> Hive tables have a schema, which is copied into the partition storage descriptor when
adding a partition. Currently only columns stored in the table storage descriptor are copied
- columns that are reported by the serde are not copied. Instead of copying the table storage
descriptor columns into the partition columns, the full table schema should be copied.
> DETAILS
> This is a little long but is necessary to show 3 things: current behavior when explicitly
listing columns, behavior with HIVE-2941 patched in and serde reported columns, and finally
the behavior with this patch (full table schema copied into the partition storage descriptor).
> Here's an example of what currently happens. Note the following:
> * the two manually-defined fields defined for the table are listed in the table storage
descriptor.
> * both fields are present in the partition storage descriptor
> This works great because users who query for a partition can look at its storage descriptor
and get the schema.
> {code}
> hive> create external table foo_test (name string, age int) partitioned by (part_dt
string);
> hive> describe extended foo_test;
> OK
> name	string	
> age	int	
> part_dt	string	
> 	 	 
> Detailed Table Information	Table(tableName:foo_test, dbName:travis_test, owner:travis,
createTime:1334256062, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:name,
type:string, comment:null), FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt,
type:string, comment:null)], location:hdfs://foo.com/warehouse/travis_test.db/foo_test, inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false,
numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:,
secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)],
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256062}, viewOriginalText:null, viewExpandedText:null,
tableType:EXTERNAL_TABLE)	
> Time taken: 0.082 seconds
> hive> alter table foo_test add partition (part_dt = '20120331T000000Z') location 'hdfs://foo.com/foo/2012/03/31/00';
> hive> describe extended foo_test partition (part_dt = '20120331T000000Z');
> OK
> name	string	
> age	int	
> part_dt	string	
> 	 	 
> Detailed Partition Information	Partition(values:[20120331T000000Z], dbName:travis_test,
tableName:foo_test, createTime:1334256131, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name,
type:string, comment:null), FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt,
type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false,
numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:,
secondaryRegions:[]), parameters:{transient_lastDdlTime=1334256131})	
> {code}
> CURRENT BEHAVIOR WITH HIVE-2941 PATCHED IN
> Now let's examine what happens when creating a table when the serde reports the schema.
Notice the following:
> * The table storage descriptor contains an empty list of columns. However, the table
schema is available from the serde reflecting on the serialization class.
> * The partition storage descriptor does contain a single "part_dt" column that was copied
from the table partition keys. The actual data columns are not present.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_dt string)
row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with serdeproperties ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person")
stored as inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.08 seconds
> hive> describe extended person_test;
> OK
> name	struct<first_name:string,last_name:string>	from deserializer
> id	int	from deserializer
> email	string	from deserializer
> phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
> part_dt	string	
> 	 	 
> Detailed Table Information	Table(tableName:person_test, dbName:travis_test, owner:travis,
createTime:1334256942, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:hdfs://foo.com/warehouse/travis_test.db/person_test,
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}),
bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt,
type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256942},
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)	
> Time taken: 0.147 seconds
> hive> alter table person_test add partition (part_dt = '20120331T000000Z') location
'hdfs://foo.com/foo/2012/03/31/00'; 
> OK
> Time taken: 0.149 seconds
> hive> describe extended person_test partition (part_dt = '20120331T000000Z');
> OK
> part_dt	string	
> 	 	 
> Detailed Partition Information	Partition(values:[20120331T000000Z], dbName:travis_test,
tableName:person_test, createTime:1334257029, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:part_dt,
type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false,
numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}),
bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), parameters:{transient_lastDdlTime=1334257029})

> Time taken: 0.106 seconds
> hive> 
> {code}
> PROPOSED BEHAVIOR
> I believe the correct thing to do is copy the full table schema (serde-reported columns
+ partition keys) into the partition storage descriptor. Notice the following:
> * Table storage descriptor does not contain any columns, because they are reported by
the serde.
> * Partition storage descriptor now contains both the serde-reported schema, and full
table schema.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_dt string)
row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with serdeproperties ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person")
stored as inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.076 seconds
> hive> describe extended person_test;                                             
                                                                                       OK
                                                                                         
                                                                            name    struct<first_name:string,last_name:string>
     from deserializer
> id	int	from deserializer
> email	string	from deserializer
> phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
> part_dt	string	
> 	 	 
> Detailed Table Information	Table(tableName:person_test, dbName:travis_test, owner:travis,
createTime:1334257489, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:hdfs://foo.com/warehouse/travis_test.db/person_test,
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}),
bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt,
type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334257489},
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)	
> Time taken: 0.155 seconds
> hive> alter table person_test add partition (part_dt = '20120331T000000Z') location
'hdfs://foo.com/foo/2012/03/31/00';
> OK                                                                                  
                                                                                    Time taken:
0.296 seconds                                        
> hive> describe extended person_test partition (part_dt = '20120331T000000Z');    
                                                                                       OK
                                                                                         
                                                                            name    struct<first_name:string,last_name:string>
     from deserializer
> id	int	from deserializer
> email	string	from deserializer
> phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
> part_dt	string	
> 	 	 
> Detailed Partition Information	Partition(values:[20120331T000000Z], dbName:travis_test,
tableName:person_test, createTime:1334257504, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name,
type:struct<first_name:string,last_name:string>, comment:from deserializer), FieldSchema(name:id,
type:int, comment:from deserializer), FieldSchema(name:email, type:string, comment:from deserializer),
FieldSchema(name:phones, type:array<struct<number:string,type:struct<value:int>>>,
comment:from deserializer), FieldSchema(name:part_dt, type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00,
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}),
bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), parameters:{transient_lastDdlTime=1334257504})

> Time taken: 0.133 seconds
> hive> 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message