hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2246) Dedupe tables' column schemas from partitions in the metastore db
Date Mon, 08 Aug 2011 20:55:29 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081188#comment-13081188
] 

jiraposter@reviews.apache.org commented on HIVE-2246:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/
-----------------------------------------------------------

(Updated 2011-08-08 20:55:11.546253)


Review request for hive, Ning Zhang and Paul Yang.


Changes
-------

added derby upgrade and revert-the-upgrade script


Summary
-------

This patch tries to make minimal changes to the API while keeping migration short and somewhat
easy to revert.

The new schema can be described as follows:
- CDS is a table corresponding to Column Descriptor objects.  Currently, it only stores a
CD_ID.
- COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A Column Descriptor
holds a list of columns.  COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.
- SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID
which describes its columns.

During migration, we create Column Descriptors for tables in a straightforward manner: their
columns are now just wrapped inside a column descriptor.  The SDS of partitions use their
parent table's column descriptor, since currently a partition and its table share the same
list of columns.

When altering or adding a partition, give it it's parent table's column descriptor IF the
columns they describe are the same.  Otherwise, create a new column descriptor for its columns.

When adding or altering a table, create a new column descriptor every time.

Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to
see if the related column descriptor has any other references in the table.  That is, check
to see if any other storage descriptors point to that column descriptor.  If none do, then
delete that column descriptor.  This check is in place so we don't have unreferenced column
descriptors and columns hanging around after schema evolution for tables.


This addresses bug HIVE-2246.
    https://issues.apache.org/jira/browse/HIVE-2246


Diffs (updated)
-----

  trunk/metastore/scripts/upgrade/derby/008-HIVE-2246.derby.sql PRE-CREATION 
  trunk/metastore/scripts/upgrade/derby/008-REVERT-HIVE-2246.derby.sql PRE-CREATION 
  trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927 
  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
PRE-CREATION 
  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
1153927 
  trunk/metastore/src/model/package.jdo 1153927 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927 

Diff: https://reviews.apache.org/r/1183/diff


Testing
-------

Passes facebook's regression testing and all existing test cases.  In one instance, before
migration, the overhead involved with storage descriptors and columns was ~11 GB.  After migration,
the overhead was ~1.5 GB.


Thanks,

Sohan



> Dedupe tables' column schemas from partitions in the metastore db
> -----------------------------------------------------------------
>
>                 Key: HIVE-2246
>                 URL: https://issues.apache.org/jira/browse/HIVE-2246
>             Project: Hive
>          Issue Type: Improvement
>          Components: Metastore
>            Reporter: Sohan Jain
>            Assignee: Sohan Jain
>         Attachments: HIVE-2246.2.patch, HIVE-2246.3.patch, HIVE-2246.4.patch
>
>
> Note: this patch proposes a schema change, and is therefore incompatible with the current
metastore.
> We can re-organize the JDO models to reduce space usage to keep the metastore scalable
for the future.  Currently, partitions are the fastest growing objects in the metastore, and
the metastore keeps a separate copy of the columns list for each partition.  We can normalize
the metastore db by decoupling Columns from Storage Descriptors and not storing duplicate
lists of the columns for each partition. 
> An idea is to create an additional level of indirection with a "Column Descriptor" that
has a list of columns.  A table has a reference to its latest Column Descriptor (note: a table
may have more than one Column Descriptor in the case of schema evolution).  Partitions and
Indexes can reference the same Column Descriptors as their parent table.
> Currently, the COLUMNS table in the metastore has roughly (number of partitions + number
of tables) * (average number of columns pertable) rows.  We can reduce this to (number of
tables) * (average number of columns per table) rows, while incurring a small cost proportional
to the number of tables to store the Column Descriptors.
> Please see the latest review board for additional implementation details.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message