hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namit Jain (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2246) Dedupe tables' column schemas from partitions in the metastore db
Date Fri, 02 Dec 2011 17:49:40 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161747#comment-13161747
] 

Namit Jain commented on HIVE-2246:
----------------------------------

Note that there is a bug in the upgrade script. After running this script, the column information
for all the partitions is lost. They all inherit the columns from the table definition. It
is not a serious problem, as the
partition column information is not really used by Hive. The only command whose results will
change is:

describe table <T> partition <P>;
                
> Dedupe tables' column schemas from partitions in the metastore db
> -----------------------------------------------------------------
>
>                 Key: HIVE-2246
>                 URL: https://issues.apache.org/jira/browse/HIVE-2246
>             Project: Hive
>          Issue Type: Improvement
>          Components: Metastore
>            Reporter: Sohan Jain
>            Assignee: Sohan Jain
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2246.2.patch, HIVE-2246.3.patch, HIVE-2246.4.patch, HIVE-2246.8.patch
>
>
> Note: this patch proposes a schema change, and is therefore incompatible with the current
metastore.
> We can re-organize the JDO models to reduce space usage to keep the metastore scalable
for the future.  Currently, partitions are the fastest growing objects in the metastore, and
the metastore keeps a separate copy of the columns list for each partition.  We can normalize
the metastore db by decoupling Columns from Storage Descriptors and not storing duplicate
lists of the columns for each partition. 
> An idea is to create an additional level of indirection with a "Column Descriptor" that
has a list of columns.  A table has a reference to its latest Column Descriptor (note: a table
may have more than one Column Descriptor in the case of schema evolution).  Partitions and
Indexes can reference the same Column Descriptors as their parent table.
> Currently, the COLUMNS table in the metastore has roughly (number of partitions + number
of tables) * (average number of columns pertable) rows.  We can reduce this to (number of
tables) * (average number of columns per table) rows, while incurring a small cost proportional
to the number of tables to store the Column Descriptors.
> Please see the latest review board for additional implementation details.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message