hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shreepadma Venugopalan" <shreepa...@cloudera.com>
Subject Re: Review Request: HIVE-1362: Support for column statistics in Hive
Date Mon, 22 Oct 2012 06:27:44 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6878/
-----------------------------------------------------------

(Updated Oct. 22, 2012, 6:27 a.m.)


Review request for hive and Carl Steinbach.


Changes
-------

This revision contains the changes requested in revision # 5. It also adds a new config param
HIVE_STATS_NDV_ERROR which allows the user to specify the standard error allowance for ndv
estimates. The error percentage is used to determine the number of bit vectors used in the
computation. A lower value of error means a larger number of bit vectors and a greater compute
cost. For instance a standard error of 10% translates to 64 bit vectors. For the table of
error and bit vectors, please refer to the error table in the original paper (Probabilistic
Counting Algorithm for Database Applications, Flajolet, P., Martin, N.G.)


Description
-------

This patch implements version 1 of the column statistics project in Hive. It adds support
for computing and persisting statistical summary of column values in Hive Tables and Partitions.
In order to support column statistics in Hive, this patch does the following,

* Adds a new compute stats UDAF to compute scalar statistics for all primitive Hive data types.
In version 1 of the project, we support the following scalar statistics on primitive types
- estimate of number of distinct values, number of null values, number of trues/falses for
boolean typed columsn, max and avg length for string and binary typed columns, max and min
value for long and double typed columns. Note that version 1 of the column stats project includes
support for column statistics both at the table and partition level.

* Adds Metastore schema tables to persist the newly added statistics both at table and partition
level.
* Adds Metastore Thrift API to persist, retrieve and delete column statistics at both table
and partition level. 
Please refer to the following wiki link for the details of the schema and the Thrift API changes
- https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive

* Extends the analyze table compute statistics statement to trigger statistics computation
and persistence for one or more columns. Please note that statistics for multiple columns
is computed through a single scan of the table data. Please refer to the following wiki link
for the syntax changes - https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive

One thing missing from the patch at this point is the metastore upgrade scrips for MySQL/Derby/Postgres/Oracle.
I'm waiting for the review to finalize the metastore schema changes before I go ahead and
add the upgrade scripts.

In a follow on patch, as part of version 2 of the column statistics project, we will add support
for computing, persisting and retrieving histograms on long and double typed column values.

Generated Thrift files have been removed for viewing pleasure. JIRA page has the patch with
the generated Thrift files.


This addresses bug HIVE-1362.
    https://issues.apache.org/jira/browse/HIVE-1362


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java f86d6a7 
  conf/hive-default.xml.template 4a59fb6 
  data/files/UserVisits.dat PRE-CREATION 
  data/files/bool.txt PRE-CREATION 
  data/files/double.txt PRE-CREATION 
  data/files/employee.dat PRE-CREATION 
  data/files/employee2.dat PRE-CREATION 
  data/files/int.txt PRE-CREATION 
  ivy/libraries.properties 7ac6778 
  metastore/if/hive_metastore.thrift d4fad72 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 8fec13d 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java 17b986c 
  metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java 3883b5b 
  metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java eff44b1 
  metastore/src/java/org/apache/hadoop/hive/metastore/RawStore.java bf5ae3a 
  metastore/src/java/org/apache/hadoop/hive/metastore/Warehouse.java 77d1caa 
  metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java
PRE-CREATION 
  metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java PRE-CREATION

  metastore/src/model/package.jdo 38ce6d5 
  metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java 528a100

  metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java 925938d 
  ql/build.xml 5de3f78 
  ql/if/queryplan.thrift 05fbf58 
  ql/ivy.xml aa3b8ce 
  ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnStatsTask.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 425900d 
  ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 4c8831f 
  ql/src/java/org/apache/hadoop/hive/ql/exec/Task.java 4446952 
  ql/src/java/org/apache/hadoop/hive/ql/exec/TaskFactory.java 79b87f1 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 7440889 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java
0b55ac4 
  ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java c9e356a 
  ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java 5fc6a4f 
  ql/src/java/org/apache/hadoop/hive/ql/parse/ExplainSemanticAnalyzer.java e75a075 
  ql/src/java/org/apache/hadoop/hive/ql/parse/ExportSemanticAnalyzer.java 61bc7fd 
  ql/src/java/org/apache/hadoop/hive/ql/parse/FunctionSemanticAnalyzer.java 6024dd4 
  ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 5884328 
  ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java 09ef969 
  ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java a0ccbe6 
  ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java a8aef4c 
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java bdae9d5 
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java e77d59e 
  ql/src/java/org/apache/hadoop/hive/ql/parse/StatsSemanticAnalyzer.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsDesc.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsWork.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/HiveOperation.java 11db6b7 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/DoubleNumDistinctValueEstimator.java PRE-CREATION

  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java PRE-CREATION

  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/LongNumDistinctValueEstimator.java PRE-CREATION

  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java PRE-CREATION

  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/StringNumDistinctValueEstimator.java PRE-CREATION

  ql/src/test/queries/clientpositive/columnstats_partlvl.q PRE-CREATION 
  ql/src/test/queries/clientpositive/columnstats_tbllvl.q PRE-CREATION 
  ql/src/test/queries/clientpositive/compute_stats_boolean.q PRE-CREATION 
  ql/src/test/queries/clientpositive/compute_stats_double.q PRE-CREATION 
  ql/src/test/queries/clientpositive/compute_stats_long.q PRE-CREATION 
  ql/src/test/queries/clientpositive/compute_stats_string.q PRE-CREATION 
  ql/src/test/results/clientpositive/columnstats_partlvl.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/columnstats_tbllvl.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/compute_stats_boolean.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/compute_stats_double.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/compute_stats_long.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/compute_stats_string.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/show_functions.q.out 02f6a94 
  ql/src/test/results/clientpositive/udaf_histogram.q.out PRE-CREATION 
  serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java
5430814 

Diff: https://reviews.apache.org/r/6878/diff/


Testing
-------

All the existing hive tests pass. Additionally this patch adds the following unit tests,

* Tests to TestHiveMetaStore.java to test the Metastore schema and Thrift API changes,
* Tests to exercise compute_stats UDAF for all primitive types,
* End to end test both at table and partition level for computing stats on multiple columns.
Note that these tests use the extended syntax of the analyze command.


Thanks,

Shreepadma Venugopalan


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message