hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sohan Jain" <sohanj...@fb.com>
Subject Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db
Date Fri, 05 Aug 2011 20:46:16 GMT


> On 2011-07-25 06:46:04, Ning Zhang wrote:
> > trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql, line 76
> > <https://reviews.apache.org/r/1183/diff/2/?file=26824#file26824line76>
> >
> >     is the CHARSET (latin1) the same as SDS? This will require the user's comments
to be in latin1 which prevents UTF chars.

Yes, this charset matches the same ones from the official hive schema for 0.7.0.


> On 2011-07-25 06:46:04, Ning Zhang wrote:
> > trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql, line 206
> > <https://reviews.apache.org/r/1183/diff/2/?file=26824#file26824line206>
> >
> >     can you also add migration script for derby? we support derby as a default metastore
RDBMS as well.

Ok, will do.  I will add it in the next-next diff here.


> On 2011-07-25 06:46:04, Ning Zhang wrote:
> > trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java, line
1752
> > <https://reviews.apache.org/r/1183/diff/2/?file=26825#file26825line1752>
> >
> >     here do you check if the 'alter table' command changes the schema (columns definition)?
If it just set a table property, then you don't need to create a new ColumnDescriptor right?
> >     
> >     Also if a table's schema got changed, a new CD will be created, but the old
partition will still have the old CDs. When we query the old partition, do we use the old
partitons's CD or the table's CD? 
> >     
> >     Also in the above case, when you run 'desc table partition <old_partition>',
do you return the old partition's CD or the table's CD?

Good point; I should check whether the table columns have changed; I do this already when
altering partitions.  I added that in the next diff.

If a table's schema changes, it does not update existing partition CDs.  If we ever grab the
partition object after the schema change, it will refer to its old CD, not the table's CD.
 However, when querying tables on the CLI, we almost always use the table's set of columns.
 E.g., if did:
> create table test (a string) partitioned by (p1 string, p2 string);
> alter table test add partition(p1=1, p2=1);
> # populate the p1=1, p2=1 partition with some data now
> alter table test add columns (b string)
> select * from test where p1 = 1 and p2 = 1,

it'd use the table's latest schema; i.e., return the column 'a's values and the column 'b'
as all NULL.


- Sohan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/#review1176
-----------------------------------------------------------


On 2011-07-22 05:30:29, Sohan Jain wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/1183/
> -----------------------------------------------------------
> 
> (Updated 2011-07-22 05:30:29)
> 
> 
> Review request for hive, Ning Zhang and Paul Yang.
> 
> 
> Summary
> -------
> 
> This patch tries to make minimal changes to the API while keeping migration short and
somewhat easy to revert.
> 
> The new schema can be described as follows:
> - CDS is a table corresponding to Column Descriptor objects.  Currently, it only stores
a CD_ID.
> - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A Column
Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to the CD_ID to which it
belongs.
> - SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to
a CD_ID which describes its columns.
> 
> During migration, we create Column Descriptors for tables in a straightforward manner:
their columns are now just wrapped inside a column descriptor.  The SDS of partitions use
their parent table's column descriptor, since currently a partition and its table share the
same list of columns.
> 
> When altering or adding a partition, give it it's parent table's column descriptor IF
the columns they describe are the same.  Otherwise, create a new column descriptor for its
columns.
> 
> When adding or altering a table, create a new column descriptor every time.
> 
> Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check
to see if the related column descriptor has any other references in the table.  That is, check
to see if any other storage descriptors point to that column descriptor.  If none do, then
delete that column descriptor.  This check is in place so we don't have unreferenced column
descriptors and columns hanging around after schema evolution for tables.
> 
> 
> This addresses bug HIVE-2246.
>     https://issues.apache.org/jira/browse/HIVE-2246
> 
> 
> Diffs
> -----
> 
>   trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION 
>   trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1148945

>   trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
PRE-CREATION 
>   trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
1148945 
>   trunk/metastore/src/model/package.jdo 1148945 
> 
> Diff: https://reviews.apache.org/r/1183/diff
> 
> 
> Testing
> -------
> 
> Passes facebook's regression testing and all existing test cases.  In one instance, before
migration, the overhead involved with storage descriptors and columns was ~11 GB.  After migration,
the overhead was ~1.5 GB.
> 
> 
> Thanks,
> 
> Sohan
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message