hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang" <nzh...@fb.com>
Subject Re: Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
Date Tue, 31 May 2011 23:45:39 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/#review725
-----------------------------------------------------------



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
<https://reviews.apache.org/r/785/#comment1468>

    For a better debugging info, print out the key and the valid stats keys. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java
<https://reviews.apache.org/r/785/#comment1492>

    currentValue + amount will result in object creation. This is very expensive in the this
case since this function is called for every input row. Instead of using immutable class Long,
LongWritable maybe a better choice. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<https://reviews.apache.org/r/785/#comment1501>

    Also consider using LongWritable rather than Long. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<https://reviews.apache.org/r/785/#comment1502>

    LongWritable. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<https://reviews.apache.org/r/785/#comment1503>

    Can you print the stack trace to LOG rather than to console?



trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java
<https://reviews.apache.org/r/785/#comment1504>

    declaration should be interface List rather than implementation ArrayList.



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
<https://reviews.apache.org/r/785/#comment1506>

    Better use Map rather than HashMap in declaration



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
<https://reviews.apache.org/r/785/#comment1507>

    Can you change it to use Utilities.executeWithRetry() as well?



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java
<https://reviews.apache.org/r/785/#comment1508>

    Let's also put the comment here as in other statements. 



trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out
<https://reviews.apache.org/r/785/#comment1509>

    The uncompressed size is smaller than the totalSize. Can you double check if this is because
of the overhead (headers etc) in the fileformat or because of a bug in the stats?



trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java
<https://reviews.apache.org/r/785/#comment1510>

    Please add some comments here on what it is used for.



trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java
<https://reviews.apache.org/r/785/#comment1511>

    Would it make sense to add the size of field delimiters as well? And if we know the record
delimiters (for most record reader it is a newline), we can add that too. This will make the
stats more accurately reflect the real uncompressed size stored in the file.



trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java
<https://reviews.apache.org/r/785/#comment1512>

    may be simpler just use == rather than ! and ^.  Also consider assert rather than returning
null?



trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java
<https://reviews.apache.org/r/785/#comment1513>

    same as above.


- Ning


On 2011-05-26 21:27:34, Tomasz Nykiel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/785/
> -----------------------------------------------------------
> 
> (Updated 2011-05-26 21:27:34)
> 
> 
> Review request for hive.
> 
> 
> Summary
> -------
> 
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics
about the number of rows per partition/table. 
> Other statistics (e.g., total table/partition size) are derived from the file system.
> 
> We introduce a new feature for collecting information about the sizes of uncompressed
data, to be able to determine the efficiency of compression.
> On top of adding the new statistic collected, this patch extends the stats collection
mechanism, so any new statistics could be added easily.
> 
> 1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed
data, when serializing/deserializing objects.
> We support:
> 
> Columnar SerDe
> LazySimpleSerDe
> LazyBinarySerDe
> 
> For other SerDe classes the uncompressed siez will be 0.
> 
> 2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection
for both JDBC and HBase.
> 
> 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator
respectively are extended to support multi-stats collection.
> 
> (2) and (3) enable easy extension for other types of statistics.
> 
> 4. Collecting uncompressed size can be disabled by setting:
> 
> hive.stats.collect.uncompressedsize = false
> 
> 
> This addresses bug HIVE-2185.
>     https://issues.apache.org/jira/browse/HIVE-2185
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1128070 
>   trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1128070

>   trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1128070

>   trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
1128070 
>   trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1128070 
>   trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
1128070 
>   trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
1128070 
>   trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
1128070 
>   trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION

>   trunk/hbase-handler/src/test/queries/hbase_stats.q 1128070 
>   trunk/hbase-handler/src/test/queries/hbase_stats2.q PRE-CREATION 
>   trunk/hbase-handler/src/test/results/hbase_stats.q.out 1128070 
>   trunk/hbase-handler/src/test/results/hbase_stats2.q.out PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1128070

>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1128070

>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
1128070 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION

>   trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1128070 
>   trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION

>   trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1128070 
>   trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/combine2.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/merge3.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/merge4.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/pcr.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/sample10.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/stats11.q.out 1128070 
>   trunk/ql/src/test/results/clientpositive/stats14.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/stats15.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/union22.q.out 1128070 
>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/Deserializer.java 1128070 
>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/MetadataTypedColumnsetSerDe.java
1128070 
>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java PRE-CREATION 
>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStatsStruct.java PRE-CREATION

>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/Serializer.java 1128070 
>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/TypedSerDe.java 1128070 
>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java
1128070 
>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java 1128070

>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java 1128070

>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/DynamicSerDe.java 1128070

>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java 1128070

>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyStruct.java 1128070 
>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java
1128070 
>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java
1128070 
>   trunk/serde/src/java/org/apache/hadoop/hive/serde2/thrift/ThriftDeserializer.java 1128070

>   trunk/serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java PRE-CREATION

> 
> Diff: https://reviews.apache.org/r/785/diff
> 
> 
> Testing
> -------
> 
> - additional JUnit test for Serializer/Deserializer amended classes
> - additional queries for TestCliDriver over multi-partition tables
> - all other JUnit tests
> - standalone setup 
> 
> 
> Thanks,
> 
> Tomasz
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message