Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Tue, 17 Jul 2012 17:09:34 +0000 (UTC)
From: "Travis Crawford (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <1663079839.64699.1342544974982.JavaMail.jiratomcat@issues-vm>
In-Reply-To: 
 <1520141428.18752.1334257877546.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Updated] (HIVE-2950) Hive should store the full table
 schema in partition storage descriptors
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/HIVE-2950?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:all-tabpanel ]

Travis Crawford updated HIVE-2950:
----------------------------------

    Resolution: Not A Problem
        Status: Resolved  (was: Patch Available)

After looking at this further, this change is not actually needed.

The confusion arises from Hive having two sets of classes to represent the =
main objects (tables, partitions, ...). If you use metastore.api classes th=
e fields are not available unless stored in the metastore. However, if usin=
g the ql.metadata classes, Partition copies the table fields to the partiti=
on if they're empty. This works great for thrift-based tables.
               =20
> Hive should store the full table schema in partition storage descriptors
> ------------------------------------------------------------------------
>
>                 Key: HIVE-2950
>                 URL: https://issues.apache.org/jira/browse/HIVE-2950
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Travis Crawford
>            Assignee: Travis Crawford
>         Attachments: HIVE-2950.D2769.1.patch
>
>
> Hive tables have a schema, which is copied into the partition storage des=
criptor when adding a partition. Currently only columns stored in the table=
 storage descriptor are copied - columns that are reported by the serde are=
 not copied. Instead of copying the table storage descriptor columns into t=
he partition columns, the full table schema should be copied.
> DETAILS
> This is a little long but is necessary to show 3 things: current behavior=
 when explicitly listing columns, behavior with HIVE-2941 patched in and se=
rde reported columns, and finally the behavior with this patch (full table =
schema copied into the partition storage descriptor).
> Here's an example of what currently happens. Note the following:
> * the two manually-defined fields defined for the table are listed in the=
 table storage descriptor.
> * both fields are present in the partition storage descriptor
> This works great because users who query for a partition can look at its =
storage descriptor and get the schema.
> {code}
> hive> create external table foo_test (name string, age int) partitioned b=
y (part_dt string);
> hive> describe extended foo_test;
> OK
> name=09string=09
> age=09int=09
> part_dt=09string=09
> =09 =09=20
> Detailed Table Information=09Table(tableName:foo_test, dbName:travis_test=
, owner:travis, createTime:1334256062, lastAccessTime:0, retention:0, sd:St=
orageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), Fi=
eldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt, type=
:string, comment:null)], location:hdfs://foo.com/warehouse/travis_test.db/f=
oo_test, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat=
:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:fal=
se, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apac=
he.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.forma=
t=3D1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, sec=
ondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt, type:string, co=
mment:null)], parameters:{EXTERNAL=3DTRUE, transient_lastDdlTime=3D13342560=
62}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE=
)=09
> Time taken: 0.082 seconds
> hive> alter table foo_test add partition (part_dt =3D '20120331T000000Z')=
 location 'hdfs://foo.com/foo/2012/03/31/00';
> hive> describe extended foo_test partition (part_dt =3D '20120331T000000Z=
');
> OK
> name=09string=09
> age=09int=09
> part_dt=09string=09
> =09 =09=20
> Detailed Partition Information=09Partition(values:[20120331T000000Z], dbN=
ame:travis_test, tableName:foo_test, createTime:1334256131, lastAccessTime:=
0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:n=
ull), FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_=
dt, type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00,=
 inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apa=
che.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numB=
uckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoo=
p.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=3D1}),=
 bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRe=
gions:[]), parameters:{transient_lastDdlTime=3D1334256131})=09
> {code}
> CURRENT BEHAVIOR WITH HIVE-2941 PATCHED IN
> Now let's examine what happens when creating a table when the serde repor=
ts the schema. Notice the following:
> * The table storage descriptor contains an empty list of columns. However=
, the table schema is available from the serde reflecting on the serializat=
ion class.
> * The partition storage descriptor does contain a single "part_dt" column=
 that was copied from the table partition keys. The actual data columns are=
 not present.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_=
dt string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerD=
e" with serdeproperties ("serialization.class"=3D"com.twitter.elephantbird.=
examples.thrift.Person") stored as inputformat "com.twitter.elephantbird.ma=
pred.input.HiveMultiInputFormat" outputformat "org.apache.hadoop.hive.ql.io=
.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.08 seconds
> hive> describe extended person_test;
> OK
> name=09struct<first_name:string,last_name:string>=09from deserializer
> id=09int=09from deserializer
> email=09string=09from deserializer
> phones=09array<struct<number:string,type:struct<value:int>>>=09from deser=
ializer
> part_dt=09string=09
> =09 =09=20
> Detailed Table Information=09Table(tableName:person_test, dbName:travis_t=
est, owner:travis, createTime:1334256942, lastAccessTime:0, retention:0, sd=
:StorageDescriptor(cols:[], location:hdfs://foo.com/warehouse/travis_test.d=
b/person_test, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiI=
nputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutp=
utFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, s=
erializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameters=
:{serialization.class=3Dcom.twitter.elephantbird.examples.thrift.Person, se=
rialization.format=3D1}), bucketCols:[], sortCols:[], parameters:{}, primar=
yRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt=
, type:string, comment:null)], parameters:{EXTERNAL=3DTRUE, transient_lastD=
dlTime=3D1334256942}, viewOriginalText:null, viewExpandedText:null, tableTy=
pe:EXTERNAL_TABLE)=09
> Time taken: 0.147 seconds
> hive> alter table person_test add partition (part_dt =3D '20120331T000000=
Z') location 'hdfs://foo.com/foo/2012/03/31/00';=20
> OK
> Time taken: 0.149 seconds
> hive> describe extended person_test partition (part_dt =3D '20120331T0000=
00Z');
> OK
> part_dt=09string=09
> =09 =09=20
> Detailed Partition Information=09Partition(values:[20120331T000000Z], dbN=
ame:travis_test, tableName:person_test, createTime:1334257029, lastAccessTi=
me:0, sd:StorageDescriptor(cols:[FieldSchema(name:part_dt, type:string, com=
ment:null)], location:hdfs://foo.com/foo/2012/03/31/00, inputFormat:com.twi=
tter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apach=
e.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuc=
kets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elepha=
ntbird.hive.serde.ThriftSerDe, parameters:{serialization.class=3Dcom.twitte=
r.elephantbird.examples.thrift.Person, serialization.format=3D1}), bucketCo=
ls:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[])=
, parameters:{transient_lastDdlTime=3D1334257029})=09
> Time taken: 0.106 seconds
> hive>=20
> {code}
> PROPOSED BEHAVIOR
> I believe the correct thing to do is copy the full table schema (serde-re=
ported columns + partition keys) into the partition storage descriptor. Not=
ice the following:
> * Table storage descriptor does not contain any columns, because they are=
 reported by the serde.
> * Partition storage descriptor now contains both the serde-reported schem=
a, and full table schema.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_=
dt string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerD=
e" with serdeproperties ("serialization.class"=3D"com.twitter.elephantbird.=
examples.thrift.Person") stored as inputformat "com.twitter.elephantbird.ma=
pred.input.HiveMultiInputFormat" outputformat "org.apache.hadoop.hive.ql.io=
.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.076 seconds
> hive> describe extended person_test;                                     =
                                                                           =
                     OK                                                    =
                                                                           =
                                        name    struct<first_name:string,la=
st_name:string>      from deserializer
> id=09int=09from deserializer
> email=09string=09from deserializer
> phones=09array<struct<number:string,type:struct<value:int>>>=09from deser=
ializer
> part_dt=09string=09
> =09 =09=20
> Detailed Table Information=09Table(tableName:person_test, dbName:travis_t=
est, owner:travis, createTime:1334257489, lastAccessTime:0, retention:0, sd=
:StorageDescriptor(cols:[], location:hdfs://foo.com/warehouse/travis_test.d=
b/person_test, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiI=
nputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutp=
utFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, s=
erializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameters=
:{serialization.class=3Dcom.twitter.elephantbird.examples.thrift.Person, se=
rialization.format=3D1}), bucketCols:[], sortCols:[], parameters:{}, primar=
yRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt=
, type:string, comment:null)], parameters:{EXTERNAL=3DTRUE, transient_lastD=
dlTime=3D1334257489}, viewOriginalText:null, viewExpandedText:null, tableTy=
pe:EXTERNAL_TABLE)=09
> Time taken: 0.155 seconds
> hive> alter table person_test add partition (part_dt =3D '20120331T000000=
Z') location 'hdfs://foo.com/foo/2012/03/31/00';
> OK                                                                       =
                                                                           =
                     Time taken: 0.296 seconds                             =
          =20
> hive> describe extended person_test partition (part_dt =3D '20120331T0000=
00Z');                                                                     =
                       OK                                                  =
                                                                           =
                                          name    struct<first_name:string,=
last_name:string>      from deserializer
> id=09int=09from deserializer
> email=09string=09from deserializer
> phones=09array<struct<number:string,type:struct<value:int>>>=09from deser=
ializer
> part_dt=09string=09
> =09 =09=20
> Detailed Partition Information=09Partition(values:[20120331T000000Z], dbN=
ame:travis_test, tableName:person_test, createTime:1334257504, lastAccessTi=
me:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:struct<first_n=
ame:string,last_name:string>, comment:from deserializer), FieldSchema(name:=
id, type:int, comment:from deserializer), FieldSchema(name:email, type:stri=
ng, comment:from deserializer), FieldSchema(name:phones, type:array<struct<=
number:string,type:struct<value:int>>>, comment:from deserializer), FieldSc=
hema(name:part_dt, type:string, comment:null)], location:hdfs://foo.com/foo=
/2012/03/31/00, inputFormat:com.twitter.elephantbird.mapred.input.HiveMulti=
InputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOut=
putFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, =
serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameter=
s:{serialization.class=3Dcom.twitter.elephantbird.examples.thrift.Person, s=
erialization.format=3D1}), bucketCols:[], sortCols:[], parameters:{}, prima=
ryRegionName:, secondaryRegions:[]), parameters:{transient_lastDdlTime=3D13=
34257504})=09
> Time taken: 0.133 seconds
> hive>=20
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs: https://issues.apache.org/jira/secure/ContactAdministrators!default.jsp=
a
For more information on JIRA, see: http://www.atlassian.com/software/jira