hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang" <nzh...@fb.com>
Subject Re: Review Request: Input Sampling Splits
Date Tue, 26 Apr 2011 22:16:03 GMT


> On 2011-04-26 20:50:30, Siying Dong wrote:
> > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498
> > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>
> >
> >     I feel like it is a little hard to explain what this sample guarantees. It basically
only guarantees that we fetch at least the sampled percentage of source data. Not exact number,
nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some
ways. How do you think?
> 
> Ning Zhang wrote:
>     I think if we specify clearly the semantics of block-level sample in the wiki/documentation,
there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based
sampling. In addition if the user has confusions about the semantics, throwing an SemanticException
won't help them understand.  I think the only use case for this parameter is to act as a gatekeeper
to this feature if we found a bug in it and want to disable the feature quickly. That should
be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for
each feature, the conf will grow unnecessarily large quickly.
> 
> Siying Dong wrote:
>     OK. I'll remove this switch. We need to document and communicate very well to users.
People will easily misunderstand this.

Sounds good. Yes, we should clearly document it in the wiki by what it does and does not.



> On 2011-04-26 20:50:30, Siying Dong wrote:
> > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392
> > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>
> >
> >     limit can be combined with block sampling. Just this optimization for limit
doesn't make sense when users already sample the input data and we won't get much benefit.
> 
> Ning Zhang wrote:
>     I think combining these two still makes sense: 1) as you mentioned block sampling
is not limiting on the # of rows, but limit is. Combining these two allows the users to get
approximately N rows quickly. 2) this restriction makes an exception in terms of the query
language composition. From the language syntax, it is allowed and makes senses to combine
block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException
should be thrown only when there is a legitimate semantic error (e.g., the percentage number
is negative). If you feel that it is not a major use case and would rather do it in a follow-up
JIRA, we should document it in TODO and file a JIRA for it.
> 
> Siying Dong wrote:
>     I think it is the misunderstanding here. Limit works with split sampling well. No
exception will be thrown in any of those two combination and the result will be what we expected.
>     This condition only disabled the optimization that runs against a smaller data set
for some limit queries. With split sampling, user already specific what a percentage to sample,
there is no need to further run the query against a small subset of the inputs. From test
case split_sample.q, you can already see how they work together well.

oh I see. Thanks for the clarification!


- Ning


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------


On 2011-04-26 21:19:18, Siying Dong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/633/
> -----------------------------------------------------------
> 
> (Updated 2011-04-26 21:19:18)
> 
> 
> Review request for hive, Ning Zhang and namit jain.
> 
> 
> Summary
> -------
> 
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have
to be strict sampling.
> 
> This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits
with size at least n% of the original inputs.
> 
> 
> This addresses bug HIVE-2121.
>     https://issues.apache.org/jira/browse/HIVE-2121
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
>   trunk/conf/hive-default.xml 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852

>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852

>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
>   trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION

>   trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION

>   trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
> 
> Diff: https://reviews.apache.org/r/633/diff
> 
> 
> Testing
> -------
> 
> TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
> 
> 
> Thanks,
> 
> Siying
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message