Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3BDC59134 for ; Wed, 3 Oct 2012 03:10:55 +0000 (UTC) Received: (qmail 36589 invoked by uid 500); 3 Oct 2012 03:10:55 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 36155 invoked by uid 500); 3 Oct 2012 03:10:54 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 36125 invoked by uid 99); 3 Oct 2012 03:10:52 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Oct 2012 03:10:52 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 41B811C0D0E; Wed, 3 Oct 2012 03:10:51 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============3019352662340157392==" MIME-Version: 1.0 Subject: Re: Review Request: HIVE-1362: Support for column statistics in Hive From: "Shreepadma Venugopalan" To: "Carl Steinbach" Cc: "hive" , "Shreepadma Venugopalan" Date: Wed, 03 Oct 2012 03:10:51 -0000 Message-ID: <20121003031051.22634.51598@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Shreepadma Venugopalan" X-ReviewGroup: hive X-ReviewRequest-URL: https://reviews.apache.org/r/6878/ X-Sender: "Shreepadma Venugopalan" References: <20120831220412.27228.31394@reviews.apache.org> In-Reply-To: <20120831220412.27228.31394@reviews.apache.org> Reply-To: "Shreepadma Venugopalan" --===============3019352662340157392== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/6878/ ----------------------------------------------------------- (Updated Oct. 3, 2012, 3:10 a.m.) Review request for hive and Carl Steinbach. Changes ------- This patch addresses the comments from revision # 1 and makes the following= changes, * Splits the ColumnStatistics thrift structure in such a way that the thrif= t API is locked down. * Splits the writeColumnStatistics API to updateTable.. and updatePartition= ... to separate out partition and table level updates * Adds comments to the Thrift RPC calls * Logs the record that is being written in update[Table/Partition]ColumnSta= tistics to identify the bad record in case of a failed update * Uses a consistent naming convention for the Thrift APIs * Incorporates the rest of the misc. review comments from revision # 1. Description ------- This patch implements version 1 of the column statistics project in Hive. I= t adds support for computing and persisting statistical summary of column v= alues in Hive Tables and Partitions. In order to support column statistics = in Hive, this patch does the following, * Adds a new compute stats UDAF to compute scalar statistics for all primit= ive Hive data types. In version 1 of the project, we support the following = scalar statistics on primitive types - estimate of number of distinct value= s, number of null values, number of trues/falses for boolean typed columsn,= max and avg length for string and binary typed columns, max and min value = for long and double typed columns. Note that version 1 of the column stats = project includes support for column statistics both at the table and partit= ion level. * Adds Metastore schema tables to persist the newly added statistics both a= t table and partition level. * Adds Metastore Thrift API to persist, retrieve and delete column statisti= cs at both table and partition level. = Please refer to the following wiki link for the details of the schema and t= he Thrift API changes - https://cwiki.apache.org/confluence/display/Hive/Co= lumn+Statistics+in+Hive * Extends the analyze table compute statistics statement to trigger statist= ics computation and persistence for one or more columns. Please note that s= tatistics for multiple columns is computed through a single scan of the tab= le data. Please refer to the following wiki link for the syntax changes - h= ttps://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive One thing missing from the patch at this point is the metastore upgrade scr= ips for MySQL/Derby/Postgres/Oracle. I'm waiting for the review to finalize= the metastore schema changes before I go ahead and add the upgrade scripts. In a follow on patch, as part of version 2 of the column statistics project= , we will add support for computing, persisting and retrieving histograms o= n long and double typed column values. Generated Thrift files have been removed for viewing pleasure. JIRA page ha= s the patch with the generated Thrift files. This addresses bug HIVE-1362. https://issues.apache.org/jira/browse/HIVE-1362 Diffs (updated) ----- data/files/UserVisits.dat PRE-CREATION = data/files/binary.txt PRE-CREATION = data/files/bool.txt PRE-CREATION = data/files/double.txt PRE-CREATION = data/files/employee.dat PRE-CREATION = data/files/employee2.dat PRE-CREATION = data/files/int.txt PRE-CREATION = ivy/libraries.properties 7ac6778 = metastore/if/hive_metastore.thrift d4fad72 = metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 8f= ec13d = metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.j= ava 17b986c = metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java= 3883b5b = metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java eff4= 4b1 = metastore/src/java/org/apache/hadoop/hive/metastore/RawStore.java bf5ae3a = metastore/src/java/org/apache/hadoop/hive/metastore/Warehouse.java 77d1ca= a = metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColu= mnStatistics.java PRE-CREATION = metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnSt= atistics.java PRE-CREATION = metastore/src/model/package.jdo 38ce6d5 = metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoCo= nnection.java 528a100 = metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStore.jav= a 925938d = ql/build.xml 5de3f78 = ql/if/queryplan.thrift 05fbf58 = ql/ivy.xml aa3b8ce = ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnStatsTask.java PRE-CREAT= ION = ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 425900d = ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 4c8831f = ql/src/java/org/apache/hadoop/hive/ql/exec/Task.java 4446952 = ql/src/java/org/apache/hadoop/hive/ql/exec/TaskFactory.java 79b87f1 = ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 7440889 = ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContext= Generator.java 0b55ac4 = ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 344= dc69 = ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java f725= 7cd = ql/src/java/org/apache/hadoop/hive/ql/parse/ExplainSemanticAnalyzer.java = e75a075 = ql/src/java/org/apache/hadoop/hive/ql/parse/ExportSemanticAnalyzer.java 6= 1bc7fd = ql/src/java/org/apache/hadoop/hive/ql/parse/FunctionSemanticAnalyzer.java= 6024dd4 = ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 356779a = ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java 0= 9ef969 = ql/src/java/org/apache/hadoop/hive/ql/parse/LoadSemanticAnalyzer.java 22f= a20f = ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java a0ccbe6 = ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java b38c002 = ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 5ce31f1 = ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java = ad1a14c = ql/src/java/org/apache/hadoop/hive/ql/parse/StatsSemanticAnalyzer.java PR= E-CREATION = ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsDesc.java PRE-CREAT= ION = ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsWork.java PRE-CREAT= ION = ql/src/java/org/apache/hadoop/hive/ql/plan/HiveOperation.java cb54753 = ql/src/java/org/apache/hadoop/hive/ql/udf/generic/DoubleNumDistinctValueE= stimator.java PRE-CREATION = ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats= .java PRE-CREATION = ql/src/java/org/apache/hadoop/hive/ql/udf/generic/LongNumDistinctValueEst= imator.java PRE-CREATION = ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimat= or.java PRE-CREATION = ql/src/java/org/apache/hadoop/hive/ql/udf/generic/StringNumDistinctValueE= stimator.java PRE-CREATION = ql/src/test/queries/clientpositive/columnstats_partlvl.q PRE-CREATION = ql/src/test/queries/clientpositive/columnstats_tbllvl.q PRE-CREATION = ql/src/test/queries/clientpositive/compute_stats_binary.q PRE-CREATION = ql/src/test/queries/clientpositive/compute_stats_boolean.q PRE-CREATION = ql/src/test/queries/clientpositive/compute_stats_double.q PRE-CREATION = ql/src/test/queries/clientpositive/compute_stats_long.q PRE-CREATION = ql/src/test/queries/clientpositive/compute_stats_string.q PRE-CREATION = ql/src/test/results/clientpositive/columnstats_partlvl.q.out PRE-CREATION = ql/src/test/results/clientpositive/columnstats_tbllvl.q.out PRE-CREATION = ql/src/test/results/clientpositive/compute_stats_binary.q.out PRE-CREATIO= N = ql/src/test/results/clientpositive/compute_stats_boolean.q.out PRE-CREATI= ON = ql/src/test/results/clientpositive/compute_stats_double.q.out PRE-CREATIO= N = ql/src/test/results/clientpositive/compute_stats_long.q.out PRE-CREATION = ql/src/test/results/clientpositive/compute_stats_string.q.out PRE-CREATIO= N = ql/src/test/results/clientpositive/show_functions.q.out 02f6a94 = ql/src/test/results/clientpositive/udaf_histogram.q.out PRE-CREATION = serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/Pr= imitiveObjectInspectorUtils.java 5430814 = Diff: https://reviews.apache.org/r/6878/diff/ Testing ------- All the existing hive tests pass. Additionally this patch adds the followin= g unit tests, * Tests to TestHiveMetaStore.java to test the Metastore schema and Thrift A= PI changes, * Tests to exercise compute_stats UDAF for all primitive types, * End to end test both at table and partition level for computing stats on = multiple columns. Note that these tests use the extended syntax of the anal= yze command. Thanks, Shreepadma Venugopalan --===============3019352662340157392==--