hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2121) Input Sampling By Splits
Date Thu, 28 Apr 2011 22:00:03 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026721#comment-13026721
] 

jiraposter@reviews.apache.org commented on HIVE-2121:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review605
-----------------------------------------------------------



trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java
<https://reviews.apache.org/r/633/#comment1249>

    talked to siying offline -
    
    the check:
    
    if (split instanceof Hadoop20Shims.InputSplitShim)
    
    
    is not needed - this can be replaced by an assert.
    
    Same in Hadoop20SShims.
    
    
    Otherwise looks good


- namit


On 2011-04-28 08:32:17, Siying Dong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/633/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-04-28 08:32:17)
bq.  
bq.  
bq.  Review request for hive, Ning Zhang and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  We need a better input sampling to serve at least two purposes:
bq.  1. test their queries against a smaller data set
bq.  2. understand more about how the data look like without scanning the whole table.
bq.  A simple function that gives a subset splits will help in those cases. It doesn't have
to be strict sampling.
bq.  
bq.  This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits
with size at least n% of the original inputs.
bq.  
bq.  
bq.  This addresses bug HIVE-2121.
bq.      https://issues.apache.org/jira/browse/HIVE-2121
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
bq.    trunk/conf/hive-default.xml 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852

bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852

bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION

bq.    trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION

bq.    trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
bq.    trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java 1096852 
bq.    trunk/shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java 1096852

bq.    trunk/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java 1096852 
bq.  
bq.  Diff: https://reviews.apache.org/r/633/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Siying
bq.  
bq.



> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch,
HIVE-2121.5.patch, HIVE-2121.6.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole table.
> A simple function that gives a subset splits will help in those cases. It doesn't have
to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message