Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8ABECDBB8 for ; Tue, 17 Jul 2012 17:09:35 +0000 (UTC) Received: (qmail 48792 invoked by uid 500); 17 Jul 2012 17:09:35 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 48747 invoked by uid 500); 17 Jul 2012 17:09:35 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 48735 invoked by uid 500); 17 Jul 2012 17:09:35 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 48729 invoked by uid 99); 17 Jul 2012 17:09:35 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jul 2012 17:09:35 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id EF422142860 for ; Tue, 17 Jul 2012 17:09:34 +0000 (UTC) Date: Tue, 17 Jul 2012 17:09:34 +0000 (UTC) From: "Travis Crawford (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: <1663079839.64699.1342544974982.JavaMail.jiratomcat@issues-vm> In-Reply-To: <1520141428.18752.1334257877546.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HIVE-2950) Hive should store the full table schema in partition storage descriptors MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-2950?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Travis Crawford updated HIVE-2950: ---------------------------------- Resolution: Not A Problem Status: Resolved (was: Patch Available) After looking at this further, this change is not actually needed. The confusion arises from Hive having two sets of classes to represent the = main objects (tables, partitions, ...). If you use metastore.api classes th= e fields are not available unless stored in the metastore. However, if usin= g the ql.metadata classes, Partition copies the table fields to the partiti= on if they're empty. This works great for thrift-based tables. =20 > Hive should store the full table schema in partition storage descriptors > ------------------------------------------------------------------------ > > Key: HIVE-2950 > URL: https://issues.apache.org/jira/browse/HIVE-2950 > Project: Hive > Issue Type: Bug > Reporter: Travis Crawford > Assignee: Travis Crawford > Attachments: HIVE-2950.D2769.1.patch > > > Hive tables have a schema, which is copied into the partition storage des= criptor when adding a partition. Currently only columns stored in the table= storage descriptor are copied - columns that are reported by the serde are= not copied. Instead of copying the table storage descriptor columns into t= he partition columns, the full table schema should be copied. > DETAILS > This is a little long but is necessary to show 3 things: current behavior= when explicitly listing columns, behavior with HIVE-2941 patched in and se= rde reported columns, and finally the behavior with this patch (full table = schema copied into the partition storage descriptor). > Here's an example of what currently happens. Note the following: > * the two manually-defined fields defined for the table are listed in the= table storage descriptor. > * both fields are present in the partition storage descriptor > This works great because users who query for a partition can look at its = storage descriptor and get the schema. > {code} > hive> create external table foo_test (name string, age int) partitioned b= y (part_dt string); > hive> describe extended foo_test; > OK > name=09string=09 > age=09int=09 > part_dt=09string=09 > =09 =09=20 > Detailed Table Information=09Table(tableName:foo_test, dbName:travis_test= , owner:travis, createTime:1334256062, lastAccessTime:0, retention:0, sd:St= orageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), Fi= eldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt, type= :string, comment:null)], location:hdfs://foo.com/warehouse/travis_test.db/f= oo_test, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat= :org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:fal= se, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apac= he.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.forma= t=3D1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, sec= ondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt, type:string, co= mment:null)], parameters:{EXTERNAL=3DTRUE, transient_lastDdlTime=3D13342560= 62}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE= )=09 > Time taken: 0.082 seconds > hive> alter table foo_test add partition (part_dt =3D '20120331T000000Z')= location 'hdfs://foo.com/foo/2012/03/31/00'; > hive> describe extended foo_test partition (part_dt =3D '20120331T000000Z= '); > OK > name=09string=09 > age=09int=09 > part_dt=09string=09 > =09 =09=20 > Detailed Partition Information=09Partition(values:[20120331T000000Z], dbN= ame:travis_test, tableName:foo_test, createTime:1334256131, lastAccessTime:= 0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:n= ull), FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_= dt, type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00,= inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apa= che.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numB= uckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoo= p.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=3D1}),= bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRe= gions:[]), parameters:{transient_lastDdlTime=3D1334256131})=09 > {code} > CURRENT BEHAVIOR WITH HIVE-2941 PATCHED IN > Now let's examine what happens when creating a table when the serde repor= ts the schema. Notice the following: > * The table storage descriptor contains an empty list of columns. However= , the table schema is available from the serde reflecting on the serializat= ion class. > * The partition storage descriptor does contain a single "part_dt" column= that was copied from the table partition keys. The actual data columns are= not present. > {code} > hive> create external table travis_test.person_test partitioned by (part_= dt string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerD= e" with serdeproperties ("serialization.class"=3D"com.twitter.elephantbird.= examples.thrift.Person") stored as inputformat "com.twitter.elephantbird.ma= pred.input.HiveMultiInputFormat" outputformat "org.apache.hadoop.hive.ql.io= .HiveIgnoreKeyTextOutputFormat"; > OK > Time taken: 0.08 seconds > hive> describe extended person_test; > OK > name=09struct=09from deserializer > id=09int=09from deserializer > email=09string=09from deserializer > phones=09array>>=09from deser= ializer > part_dt=09string=09 > =09 =09=20 > Detailed Table Information=09Table(tableName:person_test, dbName:travis_t= est, owner:travis, createTime:1334256942, lastAccessTime:0, retention:0, sd= :StorageDescriptor(cols:[], location:hdfs://foo.com/warehouse/travis_test.d= b/person_test, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiI= nputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutp= utFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, s= erializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameters= :{serialization.class=3Dcom.twitter.elephantbird.examples.thrift.Person, se= rialization.format=3D1}), bucketCols:[], sortCols:[], parameters:{}, primar= yRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt= , type:string, comment:null)], parameters:{EXTERNAL=3DTRUE, transient_lastD= dlTime=3D1334256942}, viewOriginalText:null, viewExpandedText:null, tableTy= pe:EXTERNAL_TABLE)=09 > Time taken: 0.147 seconds > hive> alter table person_test add partition (part_dt =3D '20120331T000000= Z') location 'hdfs://foo.com/foo/2012/03/31/00';=20 > OK > Time taken: 0.149 seconds > hive> describe extended person_test partition (part_dt =3D '20120331T0000= 00Z'); > OK > part_dt=09string=09 > =09 =09=20 > Detailed Partition Information=09Partition(values:[20120331T000000Z], dbN= ame:travis_test, tableName:person_test, createTime:1334257029, lastAccessTi= me:0, sd:StorageDescriptor(cols:[FieldSchema(name:part_dt, type:string, com= ment:null)], location:hdfs://foo.com/foo/2012/03/31/00, inputFormat:com.twi= tter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apach= e.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuc= kets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elepha= ntbird.hive.serde.ThriftSerDe, parameters:{serialization.class=3Dcom.twitte= r.elephantbird.examples.thrift.Person, serialization.format=3D1}), bucketCo= ls:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[])= , parameters:{transient_lastDdlTime=3D1334257029})=09 > Time taken: 0.106 seconds > hive>=20 > {code} > PROPOSED BEHAVIOR > I believe the correct thing to do is copy the full table schema (serde-re= ported columns + partition keys) into the partition storage descriptor. Not= ice the following: > * Table storage descriptor does not contain any columns, because they are= reported by the serde. > * Partition storage descriptor now contains both the serde-reported schem= a, and full table schema. > {code} > hive> create external table travis_test.person_test partitioned by (part_= dt string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerD= e" with serdeproperties ("serialization.class"=3D"com.twitter.elephantbird.= examples.thrift.Person") stored as inputformat "com.twitter.elephantbird.ma= pred.input.HiveMultiInputFormat" outputformat "org.apache.hadoop.hive.ql.io= .HiveIgnoreKeyTextOutputFormat"; > OK > Time taken: 0.076 seconds > hive> describe extended person_test; = = OK = = name struct from deserializer > id=09int=09from deserializer > email=09string=09from deserializer > phones=09array>>=09from deser= ializer > part_dt=09string=09 > =09 =09=20 > Detailed Table Information=09Table(tableName:person_test, dbName:travis_t= est, owner:travis, createTime:1334257489, lastAccessTime:0, retention:0, sd= :StorageDescriptor(cols:[], location:hdfs://foo.com/warehouse/travis_test.d= b/person_test, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiI= nputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutp= utFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, s= erializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameters= :{serialization.class=3Dcom.twitter.elephantbird.examples.thrift.Person, se= rialization.format=3D1}), bucketCols:[], sortCols:[], parameters:{}, primar= yRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt= , type:string, comment:null)], parameters:{EXTERNAL=3DTRUE, transient_lastD= dlTime=3D1334257489}, viewOriginalText:null, viewExpandedText:null, tableTy= pe:EXTERNAL_TABLE)=09 > Time taken: 0.155 seconds > hive> alter table person_test add partition (part_dt =3D '20120331T000000= Z') location 'hdfs://foo.com/foo/2012/03/31/00'; > OK = = Time taken: 0.296 seconds = =20 > hive> describe extended person_test partition (part_dt =3D '20120331T0000= 00Z'); = OK = = name struct from deserializer > id=09int=09from deserializer > email=09string=09from deserializer > phones=09array>>=09from deser= ializer > part_dt=09string=09 > =09 =09=20 > Detailed Partition Information=09Partition(values:[20120331T000000Z], dbN= ame:travis_test, tableName:person_test, createTime:1334257504, lastAccessTi= me:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:struct, comment:from deserializer), FieldSchema(name:= id, type:int, comment:from deserializer), FieldSchema(name:email, type:stri= ng, comment:from deserializer), FieldSchema(name:phones, type:array>>, comment:from deserializer), FieldSc= hema(name:part_dt, type:string, comment:null)], location:hdfs://foo.com/foo= /2012/03/31/00, inputFormat:com.twitter.elephantbird.mapred.input.HiveMulti= InputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOut= putFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, = serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameter= s:{serialization.class=3Dcom.twitter.elephantbird.examples.thrift.Person, s= erialization.format=3D1}), bucketCols:[], sortCols:[], parameters:{}, prima= ryRegionName:, secondaryRegions:[]), parameters:{transient_lastDdlTime=3D13= 34257504})=09 > Time taken: 0.133 seconds > hive>=20 > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs: https://issues.apache.org/jira/secure/ContactAdministrators!default.jsp= a For more information on JIRA, see: http://www.atlassian.com/software/jira