hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shreepadma Venugopalan" <shreepa...@cloudera.com>
Subject Re: Review Request: HIVE-1362: Support for column statistics in Hive
Date Wed, 03 Oct 2012 03:12:15 GMT


> On Sept. 13, 2012, 9:58 p.m., Carl Steinbach wrote:
> > Made it most of the way through the first page of the review. More comments to follow.

Since Hive doesn't interpret binary data, I was using string data for testing. However as
you suggest in the comment below, I'll be use the table that you have been using for HS2.


> On Sept. 13, 2012, 9:58 p.m., Carl Steinbach wrote:
> > data/files/binary.txt, line 1
> > <https://reviews.apache.org/r/6878/diff/2/?file=148716#file148716line1>
> >
> >     This doesn't look like binary data to me?

Since Hive doesn't interpret binary data, I was using string data for testing. However as
you suggest in the comment below, I'll be use the table that you have been using for HS2.


> On Sept. 13, 2012, 9:58 p.m., Carl Steinbach wrote:
> > metastore/if/hive_metastore.thrift, line 213
> > <https://reviews.apache.org/r/6878/diff/2/?file=148722#file148722line213>
> >
> >     isTblLevel looks redundant. If partName is not set, isn't it implicit that isTblLevel==true?

I prefer to add explicit flags to distinguish between different states rather than overload
the existing fields. For instance if the write_column_statistics is passed an object with
a null partName we would end up interpreting it as column statistics for the table. However
this can also occur in the case of an error when a bad column statistics struct is passed.
In the absence of a flag there is no way to tell the difference between the two states.


> On Sept. 13, 2012, 9:58 p.m., Carl Steinbach wrote:
> > metastore/if/hive_metastore.thrift, line 509
> > <https://reviews.apache.org/r/6878/diff/2/?file=148722#file148722line509>
> >
> >     Please add some comments explaining how each of these RPCs behaves, e.g. what
happens if I call delete_column_statistics_by_table on a partition table that contains partition-level
column statistics?

Will add comments.


> On Sept. 13, 2012, 9:58 p.m., Carl Steinbach wrote:
> > metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java, line 2614
> > <https://reviews.apache.org/r/6878/diff/2/?file=148723#file148723line2614>
> >
> >     In my opinion the problem with leaving lots of blank lines in the bodies of
methods is that it makes it harder to spot the beginning and end of functions which are typically
signalled by two lines of whitespace. The convention in this file is to not leave lots of
blank lines in method bodies, and it's kind of distracting to see code that doesn't follow
that convention.

I'll remove some of the line spaces in this file. I personally prefer a line space between
blocks of code in a method - I think that makes the code much more readable in an editor .
For instance I prefer a line space between variable declarations and a try-catch-finally block.
Talking of conventions, there seems to be no one convention in Hive. I try to remain consistent
with the file I'm modifying as much as possible but its hard to follow a convention when its
not very clear what it is.


- Shreepadma


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6878/#review11502
-----------------------------------------------------------


On Oct. 3, 2012, 3:10 a.m., Shreepadma Venugopalan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6878/
> -----------------------------------------------------------
> 
> (Updated Oct. 3, 2012, 3:10 a.m.)
> 
> 
> Review request for hive and Carl Steinbach.
> 
> 
> Description
> -------
> 
> This patch implements version 1 of the column statistics project in Hive. It adds support
for computing and persisting statistical summary of column values in Hive Tables and Partitions.
In order to support column statistics in Hive, this patch does the following,
> 
> * Adds a new compute stats UDAF to compute scalar statistics for all primitive Hive data
types. In version 1 of the project, we support the following scalar statistics on primitive
types - estimate of number of distinct values, number of null values, number of trues/falses
for boolean typed columsn, max and avg length for string and binary typed columns, max and
min value for long and double typed columns. Note that version 1 of the column stats project
includes support for column statistics both at the table and partition level.
> 
> * Adds Metastore schema tables to persist the newly added statistics both at table and
partition level.
> * Adds Metastore Thrift API to persist, retrieve and delete column statistics at both
table and partition level. 
> Please refer to the following wiki link for the details of the schema and the Thrift
API changes - https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive
> 
> * Extends the analyze table compute statistics statement to trigger statistics computation
and persistence for one or more columns. Please note that statistics for multiple columns
is computed through a single scan of the table data. Please refer to the following wiki link
for the syntax changes - https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive
> 
> One thing missing from the patch at this point is the metastore upgrade scrips for MySQL/Derby/Postgres/Oracle.
I'm waiting for the review to finalize the metastore schema changes before I go ahead and
add the upgrade scripts.
> 
> In a follow on patch, as part of version 2 of the column statistics project, we will
add support for computing, persisting and retrieving histograms on long and double typed column
values.
> 
> Generated Thrift files have been removed for viewing pleasure. JIRA page has the patch
with the generated Thrift files.
> 
> 
> This addresses bug HIVE-1362.
>     https://issues.apache.org/jira/browse/HIVE-1362
> 
> 
> Diffs
> -----
> 
>   data/files/UserVisits.dat PRE-CREATION 
>   data/files/binary.txt PRE-CREATION 
>   data/files/bool.txt PRE-CREATION 
>   data/files/double.txt PRE-CREATION 
>   data/files/employee.dat PRE-CREATION 
>   data/files/employee2.dat PRE-CREATION 
>   data/files/int.txt PRE-CREATION 
>   ivy/libraries.properties 7ac6778 
>   metastore/if/hive_metastore.thrift d4fad72 
>   metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 8fec13d 
>   metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java 17b986c

>   metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java 3883b5b 
>   metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java eff44b1 
>   metastore/src/java/org/apache/hadoop/hive/metastore/RawStore.java bf5ae3a 
>   metastore/src/java/org/apache/hadoop/hive/metastore/Warehouse.java 77d1caa 
>   metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java
PRE-CREATION 
>   metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java
PRE-CREATION 
>   metastore/src/model/package.jdo 38ce6d5 
>   metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java
528a100 
>   metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java 925938d

>   ql/build.xml 5de3f78 
>   ql/if/queryplan.thrift 05fbf58 
>   ql/ivy.xml aa3b8ce 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnStatsTask.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 425900d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 4c8831f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Task.java 4446952 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/TaskFactory.java 79b87f1 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 7440889 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java
0b55ac4 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 344dc69 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java f7257cd 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ExplainSemanticAnalyzer.java e75a075 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ExportSemanticAnalyzer.java 61bc7fd 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/FunctionSemanticAnalyzer.java 6024dd4 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 356779a 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java 09ef969 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/LoadSemanticAnalyzer.java 22fa20f 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java a0ccbe6 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java b38c002 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 5ce31f1 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java ad1a14c 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/StatsSemanticAnalyzer.java PRE-CREATION

>   ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsWork.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/HiveOperation.java cb54753 
>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/DoubleNumDistinctValueEstimator.java
PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java PRE-CREATION

>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/LongNumDistinctValueEstimator.java
PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java PRE-CREATION

>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/StringNumDistinctValueEstimator.java
PRE-CREATION 
>   ql/src/test/queries/clientpositive/columnstats_partlvl.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/columnstats_tbllvl.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/compute_stats_binary.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/compute_stats_boolean.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/compute_stats_double.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/compute_stats_long.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/compute_stats_string.q PRE-CREATION 
>   ql/src/test/results/clientpositive/columnstats_partlvl.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/columnstats_tbllvl.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/compute_stats_binary.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/compute_stats_boolean.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/compute_stats_double.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/compute_stats_long.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/compute_stats_string.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/show_functions.q.out 02f6a94 
>   ql/src/test/results/clientpositive/udaf_histogram.q.out PRE-CREATION 
>   serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java
5430814 
> 
> Diff: https://reviews.apache.org/r/6878/diff/
> 
> 
> Testing
> -------
> 
> All the existing hive tests pass. Additionally this patch adds the following unit tests,
> 
> * Tests to TestHiveMetaStore.java to test the Metastore schema and Thrift API changes,
> * Tests to exercise compute_stats UDAF for all primitive types,
> * End to end test both at table and partition level for computing stats on multiple columns.
Note that these tests use the extended syntax of the analyze command.
> 
> 
> Thanks,
> 
> Shreepadma Venugopalan
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message