Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3B6A038E4 for ; Thu, 28 Apr 2011 22:00:45 +0000 (UTC) Received: (qmail 51415 invoked by uid 500); 28 Apr 2011 22:00:45 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 51395 invoked by uid 500); 28 Apr 2011 22:00:44 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 51387 invoked by uid 500); 28 Apr 2011 22:00:44 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 51384 invoked by uid 99); 28 Apr 2011 22:00:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Apr 2011 22:00:44 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Apr 2011 22:00:42 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 344C2B87CB for ; Thu, 28 Apr 2011 22:00:03 +0000 (UTC) Date: Thu, 28 Apr 2011 22:00:03 +0000 (UTC) From: "jiraposter@reviews.apache.org (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: <2070543060.10006.1304028003211.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <682744632.68662.1303259285830.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HIVE-2121) Input Sampling By Splits MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026721#comment-13026721 ] jiraposter@reviews.apache.org commented on HIVE-2121: ----------------------------------------------------- ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/#review605 ----------------------------------------------------------- trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java talked to siying offline - the check: if (split instanceof Hadoop20Shims.InputSplitShim) is not needed - this can be replaced by an assert. Same in Hadoop20SShims. Otherwise looks good - namit On 2011-04-28 08:32:17, Siying Dong wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/633/ bq. ----------------------------------------------------------- bq. bq. (Updated 2011-04-28 08:32:17) bq. bq. bq. Review request for hive, Ning Zhang and namit jain. bq. bq. bq. Summary bq. ------- bq. bq. We need a better input sampling to serve at least two purposes: bq. 1. test their queries against a smaller data set bq. 2. understand more about how the data look like without scanning the whole table. bq. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. bq. bq. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. bq. bq. bq. This addresses bug HIVE-2121. bq. https://issues.apache.org/jira/browse/HIVE-2121 bq. bq. bq. Diffs bq. ----- bq. bq. trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 bq. trunk/conf/hive-default.xml 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 bq. trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION bq. trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION bq. trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION bq. trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION bq. trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION bq. trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 bq. trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java 1096852 bq. trunk/shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java 1096852 bq. trunk/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java 1096852 bq. bq. Diff: https://reviews.apache.org/r/633/diff bq. bq. bq. Testing bq. ------- bq. bq. TestCliDriver TestNegativeCliDriver, manual tests on real clusters. bq. bq. bq. Thanks, bq. bq. Siying bq. bq. > Input Sampling By Splits > ------------------------ > > Key: HIVE-2121 > URL: https://issues.apache.org/jira/browse/HIVE-2121 > Project: Hive > Issue Type: New Feature > Reporter: Siying Dong > Assignee: Siying Dong > Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, HIVE-2121.4.patch, HIVE-2121.5.patch, HIVE-2121.6.patch > > > We need a better input sampling to serve at least two purposes: > 1. test their queries against a smaller data set > 2. understand more about how the data look like without scanning the whole table. > A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira