hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
Date Thu, 02 Jun 2011 20:39:48 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043034#comment-13043034
] 

jiraposter@reviews.apache.org commented on HIVE-2185:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/
-----------------------------------------------------------

(Updated 2011-06-02 20:36:48.205733)


Review request for hive.


Changes
-------

-Fixed issues pointed out in the review.
-Changed metric name to rawDataSize instead of uncompressedSize


Summary
-------

Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics
about the number of rows per partition/table. 
Other statistics (e.g., total table/partition size) are derived from the file system.

We introduce a new feature for collecting information about the sizes of uncompressed data,
to be able to determine the efficiency of compression.
On top of adding the new statistic collected, this patch extends the stats collection mechanism,
so any new statistics could be added easily.

1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed
data, when serializing/deserializing objects.
We support:

Columnar SerDe
LazySimpleSerDe
LazyBinarySerDe

For other SerDe classes the uncompressed siez will be 0.

2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection
for both JDBC and HBase.

3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator
respectively are extended to support multi-stats collection.

(2) and (3) enable easy extension for other types of statistics.

4. Collecting uncompressed size can be disabled by setting:

hive.stats.collect.uncompressedsize = false


This addresses bug HIVE-2185.
    https://issues.apache.org/jira/browse/HIVE-2185


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1130791 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1130791 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1130791

  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1130791

  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1130791 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1130791

  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1130791

  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
1130791 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION

  trunk/hbase-handler/src/test/queries/hbase_stats.q 1130791 
  trunk/hbase-handler/src/test/queries/hbase_stats2.q PRE-CREATION 
  trunk/hbase-handler/src/test/results/hbase_stats.q.out 1130791 
  trunk/hbase-handler/src/test/results/hbase_stats2.q.out PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1130791

  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1130791

  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1130791

  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION

  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1130791 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION

  trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1130791 
  trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/combine2.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/merge3.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/merge4.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/pcr.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/sample10.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/stats11.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/stats14.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/stats15.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/union22.q.out 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/Deserializer.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/MetadataTypedColumnsetSerDe.java 1130791

  trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStatsStruct.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/Serializer.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/TypedSerDe.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java
1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java 1130791

  trunk/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/DynamicSerDe.java 1130791

  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyStruct.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java 1130791

  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java 1130791

  trunk/serde/src/java/org/apache/hadoop/hive/serde2/thrift/ThriftDeserializer.java 1130791

  trunk/serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java PRE-CREATION 

Diff: https://reviews.apache.org/r/785/diff


Testing
-------

- additional JUnit test for Serializer/Deserializer amended classes
- additional queries for TestCliDriver over multi-partition tables
- all other JUnit tests
- standalone setup 


Thanks,

Tomasz



> extend table statistics to store the size of uncompressed data (+extend interfaces for
collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2185.1.patch, HIVE-2185.2.patch, HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics
about the number of rows per partition/table. Other statistics (e.g., total table/partition
size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able
to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily
extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection
mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message