hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashutosh Chauhan <hashut...@apache.org>
Subject Re: Review Request 60753: Add HLL as an alternative to FM sketch to compute stats
Date Fri, 14 Jul 2017 20:21:55 GMT


> On July 14, 2017, 12:55 a.m., Ashutosh Chauhan wrote:
> > common/src/java/org/apache/hadoop/hive/common/ndv/NumDistinctValueEstimatorFactory.java
> > Lines 30-34 (patched)
> > <https://reviews.apache.org/r/60753/diff/2/?file=1775584#file1775584line30>
> >
> >     As discussed this determination should happen using first byte of serialized
string.
> 
> pengcheng xiong wrote:
>     That is true. But I would prefer to do it in the next following up jira as this will
totally change the serde of hll as well as fm. And, as I analyzed, I may need to do bitpacking
for fm sketch in order to fit 64k limit for varchar in the worst case.

Ok. This can be a follow-up.


> On July 14, 2017, 12:55 a.m., Ashutosh Chauhan wrote:
> > common/src/java/org/apache/hadoop/hive/common/ndv/hll/HLLSparseRegister.java
> > Lines 28-30 (patched)
> > <https://reviews.apache.org/r/60753/diff/2/?file=1775588#file1775588line28>
> >
> >     Is it really worth to have a dependency on 3rd party jar for this class. Will
using TreeMap be really slow?
> >     Javolution library on which we already have a dependcy also provides these fast
data structures. If really needed, we shall use those instead.
> 
> Prasanth_J wrote:
>     The original dependency was added to a have pretty flat treemap + list. Fast path
inserts go to list which will get sorted and merged to treemap. Fastutil has primitive hash
maps/lists with minimal object allocations. 
>     Since we are using HLL in the context of ANALYZE I think this will not have a huge
impact in terms performance. Also the threshold to switch between SPARSE to DENSE is pretty
low (in the order of 1000s), object allocations in inner loop will not be significant. 
>     
>     I agree with Ashutosh, we can remove this runtime dependency as it will not have
big impact. Replacing with java Treemap should work pretty well. If using Treemap becomes
a bottleneck we can replace it later on in a follow up jira.
>     
>     Also I would recommend having these JUnit tests as well to make sure Treemap replacement
goes well
>     https://github.com/prasanthj/hyperloglog/tree/master/src/test/com/github/prasanthj/hll

@Pengcheng,
Lets go with TreeMap for now and simplify this by not bringing a new dependency.


- Ashutosh


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60753/#review180494
-----------------------------------------------------------


On July 14, 2017, 12:08 a.m., pengcheng xiong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60753/
> -----------------------------------------------------------
> 
> (Updated July 14, 2017, 12:08 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Prasanth_J.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16966
> 
> 
> Diffs
> -----
> 
>   common/pom.xml 023f084511 
>   common/src/java/org/apache/hadoop/hive/common/ndv/NumDistinctValueEstimatorFactory.java
PRE-CREATION 
>   common/src/java/org/apache/hadoop/hive/common/ndv/hll/HLLConstants.java PRE-CREATION

>   common/src/java/org/apache/hadoop/hive/common/ndv/hll/HLLDenseRegister.java PRE-CREATION

>   common/src/java/org/apache/hadoop/hive/common/ndv/hll/HLLRegister.java PRE-CREATION

>   common/src/java/org/apache/hadoop/hive/common/ndv/hll/HLLSparseRegister.java PRE-CREATION

>   common/src/java/org/apache/hadoop/hive/common/ndv/hll/HyperLogLog.java PRE-CREATION

>   common/src/java/org/apache/hadoop/hive/common/ndv/hll/HyperLogLogUtils.java PRE-CREATION

>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java c7afe2bc4a 
>   metastore/src/java/org/apache/hadoop/hive/metastore/NumDistinctValueEstimator.java
92f9a845e3 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/StatsCache.java 18f8afc9ad

>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/ColumnStatsAggregator.java
31955b4363 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/ColumnStatsAggregatorFactory.java
daf85692eb 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/DecimalColumnStatsAggregator.java
36b2c9c56b 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/DoubleColumnStatsAggregator.java
a88ef84e5c 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/LongColumnStatsAggregator.java
8ac6561aec 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/StringColumnStatsAggregator.java
2aa4046a46 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/merge/ColumnStatsMerger.java
33c7e3e52c 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/merge/ColumnStatsMergerFactory.java
fe890e4e27 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/merge/DateColumnStatsMerger.java
3179b23438 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/merge/DecimalColumnStatsMerger.java
c13add9d9c 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/merge/DoubleColumnStatsMerger.java
fbdba24b0a 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/merge/LongColumnStatsMerger.java
ac65590505 
>   metastore/src/java/org/apache/hadoop/hive/metastore/hbase/stats/merge/StringColumnStatsMerger.java
41587477d3 
>   metastore/src/test/org/apache/hadoop/hive/metastore/hbase/TestHBaseAggregateStatsNDVUniformDist.java
87b1ac870d 
>   pom.xml 32f5fd1493 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java 0a5cf00c44

>   ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java 76f7daeb1b 
>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/DecimalNumDistinctValueEstimator.java
a05906edfa 
>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/DoubleNumDistinctValueEstimator.java
e76fc74dbc 
>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java 2ebfcb2360

>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/LongNumDistinctValueEstimator.java
1c197a028a 
>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java fa70f49857

>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/StringNumDistinctValueEstimator.java
601901c163 
>   ql/src/test/queries/clientpositive/char_udf1.q 39aa0e0e17 
>   ql/src/test/queries/clientpositive/compute_stats_date.q 09128f6fb9 
>   ql/src/test/queries/clientpositive/compute_stats_decimal.q 76e1468ada 
>   ql/src/test/queries/clientpositive/compute_stats_double.q 7a1e0f6295 
>   ql/src/test/queries/clientpositive/compute_stats_long.q 6a2070f780 
>   ql/src/test/queries/clientpositive/compute_stats_string.q 0023e7f6bd 
>   ql/src/test/queries/clientpositive/hll.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/reduceSinkDeDuplication_pRS_key_empty.q 8bbae3914d

>   ql/src/test/queries/clientpositive/varchar_udf1.q 4d1f884ea7 
>   ql/src/test/queries/clientpositive/vector_udf1.q 48d3e1ee4d 
>   ql/src/test/results/clientpositive/alter_partition_update_status.q.out 922822e6d2 
>   ql/src/test/results/clientpositive/alter_table_column_stats.q.out 2cc7cbc7b6 
>   ql/src/test/results/clientpositive/alter_table_update_status.q.out e26e8cba1c 
>   ql/src/test/results/clientpositive/analyze_tbl_part.q.out ed90b6fc92 
>   ql/src/test/results/clientpositive/annotate_stats_deep_filters.q.out 95dd6abaec 
>   ql/src/test/results/clientpositive/annotate_stats_groupby.q.out a8e4854a00 
>   ql/src/test/results/clientpositive/annotate_stats_join.q.out c1a140b558 
>   ql/src/test/results/clientpositive/annotate_stats_join_pkfk.q.out f7d73c9ddf 
>   ql/src/test/results/clientpositive/autoColumnStats_4.q.out fe3b9e53ef 
>   ql/src/test/results/clientpositive/autoColumnStats_5.q.out e19fb5f504 
>   ql/src/test/results/clientpositive/autoColumnStats_6.q.out 29b3373e10 
>   ql/src/test/results/clientpositive/autoColumnStats_7.q.out 9d24bc53ab 
>   ql/src/test/results/clientpositive/autoColumnStats_8.q.out 681d962ed0 
>   ql/src/test/results/clientpositive/autoColumnStats_9.q.out d26e2c02b7 
>   ql/src/test/results/clientpositive/auto_join_without_localtask.q.out 17a912ec13 
>   ql/src/test/results/clientpositive/avro_decimal.q.out 5a3b72defe 
>   ql/src/test/results/clientpositive/avro_decimal_native.q.out fe77512191 
>   ql/src/test/results/clientpositive/cbo_rp_annotate_stats_groupby.q.out f260f034b6 
>   ql/src/test/results/clientpositive/cbo_rp_join0.q.out b9cf3ceab4 
>   ql/src/test/results/clientpositive/char_udf1.q.out 07ce108a75 
>   ql/src/test/results/clientpositive/colstats_all_nulls.q.out 14c5d5b59b 
>   ql/src/test/results/clientpositive/column_pruner_multiple_children.q.out 96feeed49c

>   ql/src/test/results/clientpositive/columnstats_partlvl.q.out 07d26e92bb 
>   ql/src/test/results/clientpositive/columnstats_partlvl_dp.q.out 468d2e797b 
>   ql/src/test/results/clientpositive/columnstats_quoting.q.out 52e35385a1 
>   ql/src/test/results/clientpositive/columnstats_tbllvl.q.out 462d4c1771 
>   ql/src/test/results/clientpositive/compute_stats_date.q.out c2472377a8 
>   ql/src/test/results/clientpositive/compute_stats_decimal.q.out e0584c50a8 
>   ql/src/test/results/clientpositive/compute_stats_double.q.out 5b921735f0 
>   ql/src/test/results/clientpositive/compute_stats_long.q.out 119d1731cc 
>   ql/src/test/results/clientpositive/compute_stats_string.q.out 8c40490bc0 
>   ql/src/test/results/clientpositive/confirm_initial_tbl_stats.q.out faa14ba9c5 
>   ql/src/test/results/clientpositive/constant_prop_2.q.out 24be5188e2 
>   ql/src/test/results/clientpositive/correlated_join_keys.q.out ec5d008728 
>   ql/src/test/results/clientpositive/cross_join_merge.q.out f4956ded22 
>   ql/src/test/results/clientpositive/decimal_stats.q.out 5d86866e2a 
>   ql/src/test/results/clientpositive/describe_table.q.out 7869494252 
>   ql/src/test/results/clientpositive/display_colstats_tbllvl.q.out a4b18d7cec 
>   ql/src/test/results/clientpositive/encrypted/encryption_move_tbl.q.out 06f8408327 
>   ql/src/test/results/clientpositive/exec_parallel_column_stats.q.out f256ec11bf 
>   ql/src/test/results/clientpositive/hll.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/infer_bucket_sort.q.out 1aea388815 
>   ql/src/test/results/clientpositive/join32.q.out a191284aca 
>   ql/src/test/results/clientpositive/join33.q.out a191284aca 
>   ql/src/test/results/clientpositive/join_parse.q.out 17733acdc3 
>   ql/src/test/results/clientpositive/llap/autoColumnStats_2.q.out ec209b2ef1 
>   ql/src/test/results/clientpositive/llap/auto_join1.q.out 6a0a1d5d09 
>   ql/src/test/results/clientpositive/llap/auto_join21.q.out 25cd67e7f5 
>   ql/src/test/results/clientpositive/llap/auto_join29.q.out f693ce4512 
>   ql/src/test/results/clientpositive/llap/auto_join30.q.out 91a80127a9 
>   ql/src/test/results/clientpositive/llap/cluster.q.out 2fa976b6d5 
>   ql/src/test/results/clientpositive/llap/column_table_stats.q.out fb04ee8cf9 
>   ql/src/test/results/clientpositive/llap/column_table_stats_orc.q.out d55cf30331 
>   ql/src/test/results/clientpositive/llap/columnstats_part_coltype.q.out dc50fb7fc1 
>   ql/src/test/results/clientpositive/llap/correlationoptimizer1.q.out fab5c9c029 
>   ql/src/test/results/clientpositive/llap/correlationoptimizer2.q.out ee645410b9 
>   ql/src/test/results/clientpositive/llap/correlationoptimizer3.q.out f1af97d98d 
>   ql/src/test/results/clientpositive/llap/correlationoptimizer6.q.out 9b71d3ec82 
>   ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out 6d411365b8 
>   ql/src/test/results/clientpositive/llap/cross_join.q.out ae3f9bf6f5 
>   ql/src/test/results/clientpositive/llap/dynamic_semijoin_reduction.q.out 7fd7e20e15

>   ql/src/test/results/clientpositive/llap/dynamic_semijoin_user_level.q.out c4b18b7118

>   ql/src/test/results/clientpositive/llap/dynpart_sort_optimization2.q.out d4811d64d7

>   ql/src/test/results/clientpositive/llap/explainuser_1.q.out 87b8cae445 
>   ql/src/test/results/clientpositive/llap/explainuser_2.q.out f8a6526c67 
>   ql/src/test/results/clientpositive/llap/explainuser_4.q.out 95d13216e7 
>   ql/src/test/results/clientpositive/llap/extrapolate_part_stats_partial_ndv.q.out d97223c9d0

>   ql/src/test/results/clientpositive/llap/filter_union.q.out c4af31715c 
>   ql/src/test/results/clientpositive/llap/groupby1.q.out 0eecbb6f4e 
>   ql/src/test/results/clientpositive/llap/groupby2.q.out 29b85d1f44 
>   ql/src/test/results/clientpositive/llap/groupby_resolution.q.out f2a6ab05ac 
>   ql/src/test/results/clientpositive/llap/having.q.out 267254c0de 
>   ql/src/test/results/clientpositive/llap/hybridgrace_hashjoin_1.q.out 083bfc301c 
>   ql/src/test/results/clientpositive/llap/hybridgrace_hashjoin_2.q.out a59188a46a 
>   ql/src/test/results/clientpositive/llap/identity_project_remove_skip.q.out b03b96b463

>   ql/src/test/results/clientpositive/llap/jdbc_handler.q.out b4feb0ee1b 
>   ql/src/test/results/clientpositive/llap/join1.q.out d79a405a41 
>   ql/src/test/results/clientpositive/llap/join32_lessSize.q.out c226eed126 
>   ql/src/test/results/clientpositive/llap/join_max_hashtable.q.out 85d45fe712 
>   ql/src/test/results/clientpositive/llap/limit_join_transpose.q.out 99d8d6f664 
>   ql/src/test/results/clientpositive/llap/limit_pushdown.q.out 93315e6b1b 
>   ql/src/test/results/clientpositive/llap/limit_pushdown3.q.out 351ee01b45 
>   ql/src/test/results/clientpositive/llap/llap_smb.q.out e3044baf52 
>   ql/src/test/results/clientpositive/llap/llap_stats.q.out f81ad50679 
>   ql/src/test/results/clientpositive/llap/llap_vector_nohybridgrace.q.out 0e9e1207cb

>   ql/src/test/results/clientpositive/llap/llapdecider.q.out 69312cd6a2 
>   ql/src/test/results/clientpositive/llap/merge1.q.out 8021b67733 
>   ql/src/test/results/clientpositive/llap/merge2.q.out 7bcdd2d57e 
>   ql/src/test/results/clientpositive/llap/mergejoin.q.out 10fb45d284 
>   ql/src/test/results/clientpositive/llap/mrr.q.out fe477fd815 
>   ql/src/test/results/clientpositive/llap/multiMapJoin2.q.out 25378d3e8b 
>   ql/src/test/results/clientpositive/llap/offset_limit.q.out adfeb05448 
>   ql/src/test/results/clientpositive/llap/offset_limit_ppd_optimizer.q.out 48e58e1bd9

>   ql/src/test/results/clientpositive/llap/parallel_colstats.q.out 95ed8b813a 
>   ql/src/test/results/clientpositive/llap/ptf.q.out fbaf1e6474 
>   ql/src/test/results/clientpositive/llap/ptf_streaming.q.out 6013c11c9e 
>   ql/src/test/results/clientpositive/llap/reduce_deduplicate_extended.q.out bc44db795a

>   ql/src/test/results/clientpositive/llap/semijoin_hint.q.out 76c985e727 
>   ql/src/test/results/clientpositive/llap/skewjoin.q.out dc79b26020 
>   ql/src/test/results/clientpositive/llap/subquery_exists.q.out 0749872253 
>   ql/src/test/results/clientpositive/llap/subquery_in.q.out 1d5f547adf 
>   ql/src/test/results/clientpositive/llap/subquery_multi.q.out 41b8ea3f25 
>   ql/src/test/results/clientpositive/llap/subquery_notin.q.out c760af2f87 
>   ql/src/test/results/clientpositive/llap/subquery_scalar.q.out 03d5f191b6 
>   ql/src/test/results/clientpositive/llap/subquery_select.q.out 4917c42b0e 
>   ql/src/test/results/clientpositive/llap/subquery_views.q.out b64e0f49c6 
>   ql/src/test/results/clientpositive/llap/sysdb.q.out fbbf8d9b7f 
>   ql/src/test/results/clientpositive/llap/tez_dml.q.out 786929e7af 
>   ql/src/test/results/clientpositive/llap/tez_dynpart_hashjoin_1.q.out 626a0640da 
>   ql/src/test/results/clientpositive/llap/tez_dynpart_hashjoin_2.q.out 273cf06e31 
>   ql/src/test/results/clientpositive/llap/tez_dynpart_hashjoin_3.q.out 04e0f86fe2 
>   ql/src/test/results/clientpositive/llap/tez_join_hash.q.out 92a188e18a 
>   ql/src/test/results/clientpositive/llap/tez_join_tests.q.out 1e1b63d95e 
>   ql/src/test/results/clientpositive/llap/tez_joins_explain.q.out 5297d7ecbc 
>   ql/src/test/results/clientpositive/llap/tez_smb_main.q.out 66d7aeca70 
>   ql/src/test/results/clientpositive/llap/tez_union.q.out c72b232b35 
>   ql/src/test/results/clientpositive/llap/tez_union2.q.out 7b45c7c719 
>   ql/src/test/results/clientpositive/llap/tez_union_multiinsert.q.out 5e0d072095 
>   ql/src/test/results/clientpositive/llap/tez_vector_dynpart_hashjoin_1.q.out a38e339012

>   ql/src/test/results/clientpositive/llap/tez_vector_dynpart_hashjoin_2.q.out f7c694bf84

>   ql/src/test/results/clientpositive/llap/unionDistinct_1.q.out 602d57625c 
>   ql/src/test/results/clientpositive/llap/union_top_level.q.out 268e0413cd 
>   ql/src/test/results/clientpositive/llap/varchar_udf1.q.out 8162305df6 
>   ql/src/test/results/clientpositive/llap/vector_groupby_mapjoin.q.out 16b716c4e5 
>   ql/src/test/results/clientpositive/llap/vector_left_outer_join.q.out 6118c73c2b 
>   ql/src/test/results/clientpositive/llap/vector_mapjoin_reduce.q.out 8df5a64fee 
>   ql/src/test/results/clientpositive/llap/vector_outer_join1.q.out d95604d722 
>   ql/src/test/results/clientpositive/llap/vector_outer_join2.q.out 16de760424 
>   ql/src/test/results/clientpositive/llap/vector_udf1.q.out 16edaacf94 
>   ql/src/test/results/clientpositive/llap/vectorization_0.q.out fba9c07350 
>   ql/src/test/results/clientpositive/llap/vectorization_8.q.out 0d5b6d53e0 
>   ql/src/test/results/clientpositive/llap/vectorization_short_regress.q.out 00577620d8

>   ql/src/test/results/clientpositive/llap/vectorized_distinct_gby.q.out c3e5f7c90d 
>   ql/src/test/results/clientpositive/llap/vectorized_dynamic_semijoin_reduction.q.out
9a1c44c3e6 
>   ql/src/test/results/clientpositive/llap/vectorized_dynamic_semijoin_reduction2.q.out
a03466f859 
>   ql/src/test/results/clientpositive/llap/vectorized_mapjoin.q.out 8160bc7c44 
>   ql/src/test/results/clientpositive/llap/vectorized_multi_output_select.q.out f744eb6513

>   ql/src/test/results/clientpositive/llap/vectorized_nested_mapjoin.q.out 28a8340a9c

>   ql/src/test/results/clientpositive/llap/vectorized_shufflejoin.q.out 73ab9fca82 
>   ql/src/test/results/clientpositive/llap/windowing_gby.q.out 2c47b8b2a6 
>   ql/src/test/results/clientpositive/parallel_colstats.q.out c85113137b 
>   ql/src/test/results/clientpositive/partial_column_stats.q.out 5876efacf3 
>   ql/src/test/results/clientpositive/partition_coltype_literals.q.out 3505556029 
>   ql/src/test/results/clientpositive/reduceSinkDeDuplication_pRS_key_empty.q.out 3ad09e815c

>   ql/src/test/results/clientpositive/remove_exprs_stats.q.out 33cf90ae9d 
>   ql/src/test/results/clientpositive/rename_external_partition_location.q.out ec4076f908

>   ql/src/test/results/clientpositive/rename_table_update_column_stats.q.out b3d6f039ac

>   ql/src/test/results/clientpositive/temp_table_display_colstats_tbllvl.q.out bd024a7ab1

>   ql/src/test/results/clientpositive/tez/explainanalyze_1.q.out 6602222ed7 
>   ql/src/test/results/clientpositive/tez/explainanalyze_2.q.out 0916565f0f 
>   ql/src/test/results/clientpositive/tez/explainanalyze_3.q.out e5c8d6c51e 
>   ql/src/test/results/clientpositive/tez/explainanalyze_4.q.out 9fbe8c5263 
>   ql/src/test/results/clientpositive/tez/explainanalyze_5.q.out b35e294813 
>   ql/src/test/results/clientpositive/tez/explainuser_3.q.out 65c9114b20 
>   ql/src/test/results/clientpositive/tez/hybridgrace_hashjoin_1.q.out 0a1e039cf1 
>   ql/src/test/results/clientpositive/tez/hybridgrace_hashjoin_2.q.out 6f5a3a96ca 
>   ql/src/test/results/clientpositive/tez/vectorization_limit.q.out afcae8c34c 
>   ql/src/test/results/clientpositive/tunable_ndv.q.out 6ae54b4927 
> 
> 
> Diff: https://reviews.apache.org/r/60753/diff/3/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> pengcheng xiong
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message