Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 645814019 for ; Thu, 2 Jun 2011 20:37:34 +0000 (UTC) Received: (qmail 53604 invoked by uid 500); 2 Jun 2011 20:37:34 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 53569 invoked by uid 500); 2 Jun 2011 20:37:34 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Delivered-To: moderator for dev@hive.apache.org Received: (qmail 52321 invoked by uid 99); 2 Jun 2011 20:36:47 -0000 Content-Type: multipart/alternative; boundary="===============1287148481057731107==" MIME-Version: 1.0 Subject: Re: Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics) From: "Tomasz Nykiel" To: "Ning Zhang" , "Tomasz Nykiel" , "hive" Date: Thu, 02 Jun 2011 20:36:48 -0000 Message-ID: <20110602203648.27749.84338@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org X-ReviewRequest-URL: https://reviews.apache.org/r/785/ In-Reply-To: <20110526212734.10303.17077@reviews.apache.org> References: <20110526212734.10303.17077@reviews.apache.org> --===============1287148481057731107== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/785/ ----------------------------------------------------------- (Updated 2011-06-02 20:36:48.205733) Review request for hive. Changes ------- -Fixed issues pointed out in the review. -Changed metric name to rawDataSize instead of uncompressedSize Summary ------- Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we co= llect statistics about the number of rows per partition/table. = Other statistics (e.g., total table/partition size) are derived from the fi= le system. We introduce a new feature for collecting information about the sizes of un= compressed data, to be able to determine the efficiency of compression. On top of adding the new statistic collected, this patch extends the stats = collection mechanism, so any new statistics could be added easily. 1. serializer/deserializer classes are amended to accommodate collecting si= zes of uncompressed data, when serializing/deserializing objects. We support: Columnar SerDe LazySimpleSerDe LazyBinarySerDe For other SerDe classes the uncompressed siez will be 0. 2. StatsPublisher / StatsAggregator interfaces are extended to support mult= i-stats collection for both JDBC and HBase. 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and T= ableScanOperator respectively are extended to support multi-stats collectio= n. (2) and (3) enable easy extension for other types of statistics. 4. Collecting uncompressed size can be disabled by setting: hive.stats.collect.uncompressedsize =3D false This addresses bug HIVE-2185. https://issues.apache.org/jira/browse/HIVE-2185 Diffs (updated) ----- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1130791 = trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.j= ava 1130791 = trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSe= rDe.java 1130791 = trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDese= rializer.java 1130791 = trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java= 1130791 = trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggre= gator.java 1130791 = trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPubli= sher.java 1130791 = trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetup= Constants.java 1130791 = trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils= .java PRE-CREATION = trunk/hbase-handler/src/test/queries/hbase_stats.q 1130791 = trunk/hbase-handler/src/test/queries/hbase_stats2.q PRE-CREATION = trunk/hbase-handler/src/test/results/hbase_stats.q.out 1130791 = trunk/hbase-handler/src/test/results/hbase_stats2.q.out PRE-CREATION = trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 11= 30791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1130791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1130791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1130791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1= 130791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1= 130791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFac= tory.java 1130791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1= 130791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 11307= 91 = trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 11= 30791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 113= 0791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 11= 30791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregato= r.java 1130791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher= .java 1130791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupCons= tants.java 1130791 = trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.jav= a PRE-CREATION = trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java = 1130791 = trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanc= ed.java PRE-CREATION = trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1130791 = trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION = trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION = trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1130791 = trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1130791 = trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1130791 = trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1130791 = trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1130791 = trunk/ql/src/test/results/clientpositive/combine2.q.out 1130791 = trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1130= 791 = trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1130791 = trunk/ql/src/test/results/clientpositive/merge3.q.out 1130791 = trunk/ql/src/test/results/clientpositive/merge4.q.out 1130791 = trunk/ql/src/test/results/clientpositive/pcr.q.out 1130791 = trunk/ql/src/test/results/clientpositive/sample10.q.out 1130791 = trunk/ql/src/test/results/clientpositive/stats11.q.out 1130791 = trunk/ql/src/test/results/clientpositive/stats14.q.out PRE-CREATION = trunk/ql/src/test/results/clientpositive/stats15.q.out PRE-CREATION = trunk/ql/src/test/results/clientpositive/union22.q.out 1130791 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/Deserializer.java 1130= 791 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/MetadataTypedColumnset= SerDe.java 1130791 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java PRE-CR= EATION = trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStatsStruct.java = PRE-CREATION = trunk/serde/src/java/org/apache/hadoop/hive/serde2/Serializer.java 113079= 1 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/TypedSerDe.java 113079= 1 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinaryS= ortableSerDe.java 1130791 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe= .java 1130791 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruc= t.java 1130791 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/DynamicSe= rDe.java 1130791 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.j= ava 1130791 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyStruct.java 1= 130791 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryS= erDe.java 1130791 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryS= truct.java 1130791 = trunk/serde/src/java/org/apache/hadoop/hive/serde2/thrift/ThriftDeseriali= zer.java 1130791 = trunk/serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java PR= E-CREATION = Diff: https://reviews.apache.org/r/785/diff Testing ------- - additional JUnit test for Serializer/Deserializer amended classes - additional queries for TestCliDriver over multi-partition tables - all other JUnit tests - standalone setup = Thanks, Tomasz --===============1287148481057731107==--