hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sohan Jain" <sohanj...@fb.com>
Subject Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db
Date Thu, 30 Jun 2011 22:24:01 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/985/
-----------------------------------------------------------

Review request for hive.


Summary
-------

We can re-organize the JDO models to reduce space usage to keep the metastore scalable for
the future. Currently, partitions are the fastest growing objects in the metastore, and the
metastore keeps a separate copy of the columns list for each partition. We can normalize the
metastore db by decoupling Columns from Storage Descriptors and not storing duplicate lists
of the columns for each partition.

An idea is to create an additional level of indirection with a "Column Descriptor" that has
a list of columns. A table has a reference to its latest Column Descriptor (note: a table
may have more than one Column Descriptor in the case of schema evolution). Partitions and
Indexes can reference the same Column Descriptors as their parent table.

Currently, the COLUMNS table in the metastore has roughly (number of partitions + number of
tables) * (average number of columns pertable) rows. We can reduce this to (number of tables)
* (average number of columns per table) rows, while incurring a small cost proportional to
the number of tables to store the Column Descriptors.


This addresses bug HIVE-2246.
    https://issues.apache.org/jira/browse/HIVE-2246


Diffs
-----

  trunk/metastore/if/hive_metastore.thrift 1140399 
  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
PRE-CREATION 
  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MDatabase.java 1140399

  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MFieldSchema.java 1140399

  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MIndex.java 1140399 
  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartition.java 1140399

  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
1140399 
  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MTable.java 1140399 
  trunk/metastore/src/model/package.jdo 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/index/TableBasedIndexHandler.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/index/bitmap/BitmapIndexHandler.java 1140399

  trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 1140399

  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java 1140399 

Diff: https://reviews.apache.org/r/985/diff


Testing
-------

Haven't run any unit tests yet, just qualitative testing so far.


Thanks,

Sohan


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message