Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 28BA77484 for ; Tue, 1 Nov 2011 18:02:02 +0000 (UTC) Received: (qmail 90220 invoked by uid 500); 1 Nov 2011 18:02:01 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 90200 invoked by uid 500); 1 Nov 2011 18:02:01 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 90192 invoked by uid 500); 1 Nov 2011 18:02:01 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 90189 invoked by uid 99); 1 Nov 2011 18:02:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Nov 2011 18:02:01 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Nov 2011 18:01:57 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 7239A32C92E for ; Tue, 1 Nov 2011 18:01:36 +0000 (UTC) Date: Tue, 1 Nov 2011 18:01:36 +0000 (UTC) From: "jiraposter@reviews.apache.org (Commented) (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: <2099753342.46763.1320170496469.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1268566642.34047.1319839115079.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HIVE-2535?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13141= 404#comment-13141404 ]=20 jiraposter@reviews.apache.org commented on HIVE-2535: ----------------------------------------------------- ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2605/#review2974 ----------------------------------------------------------- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java The default can be true trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java nit: spelling=20 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java More comments here. It would be useful to describe when is a binary search performed. trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandl= er.java This should not be hard-coded. If user wanted HiveInputFormat, it should be=20 HiveSortedInputFormat and same for CombineHiveSortedInputFormat. =20 Do we need a new class, or can sorted be a=20 property of input format ? Then, it should automatcally work for both hiveIF and combinehiveIF trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandl= er.java use the term index column instead of non-partition column. =20 Who is using the function findNonPartitionFilterWork. It is not modifying any internal structure, and the=20 return value is not used =20 trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandl= er.java I am confused - what if the filter contains multiple non partition column predicates ? trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java As mentioned before, it would be good if this also works with CombineHi= veIF - namit On 2011-10-29 01:39:50, Kevin Wilfong wrote: bq. =20 bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/2605/ bq. ----------------------------------------------------------- bq. =20 bq. (Updated 2011-10-29 01:39:50) bq. =20 bq. =20 bq. Review request for hive, Yongqiang He, Ning Zhang, and namit jain. bq. =20 bq. =20 bq. Summary bq. ------- bq. =20 bq. The CompactIndexHandler determines if the reentrant query it creates i= s a candidate for using the fact the index is sorted (it has an appropriate= number of non-partition conditions, and the query plan is of the form expe= cted). It sets the input format to HiveSortedInputFormat, and marks the Fi= lterOperator for the non-partition condition. bq. =20 bq. The HiveSortedInputFormat is extends HiveInputFormat, so its splits co= nsist of data from a single file, and its record reader is HiveBinarySearch= RecordReader. HiveBinarySearchRecordReader starts by assuming it is perfor= ming a binary search. It sets the appropriate flags in IOContext, which ac= ts as the means of communication between the FilterOperators and the record= reader. The non-partition FilterOperator is responsible for executing a c= omparison between the value in the row and column of interest and the const= ant. It also provides the type of the generic UDF. It sets this data in t= he IOContext. As long as the binary search continues the FilterOperators d= o not forward rows to the operators below them. The record reader uses the= comparison and the type of the generic UDF to execute a binary search on t= he underlying RCFile until it finds the block of interest, or determines th= at if any block is of interest it is the last one. The search then proceed= s linearly from the beginning of the identified block. If ever in the bina= ry search a problem occurs, like the comparison fails for some reason, a li= near search begins from the beginning of the data which has yet to be elimi= nated. bq. =20 bq. Regardless of whether or not a binary search is performed, the record = reader attempts to end the linear search as soon as it can based on the com= parison and the type of the generic UDF. bq. =20 bq. =20 bq. This addresses bug HIVE-2535. bq. https://issues.apache.org/jira/browse/HIVE-2535 bq. =20 bq. =20 bq. Diffs bq. ----- bq. =20 bq. trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183= 507=20 bq. trunk/conf/hive-default.xml 1183507=20 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFunc= Evaluator.java 1183507=20 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java= 1183507=20 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactInd= exHandler.java 1183507=20 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecor= dReader.java PRE-CREATION=20 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java = 1183507=20 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java= 1183507=20 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat= .java PRE-CREATION=20 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 118350= 7=20 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507= =20 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.ja= va 1183507=20 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 118= 3507=20 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBa= seCompare.java 1183507=20 bq. trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedIn= putFormatUsedHook.java PRE-CREATION=20 bq. trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchR= ecordReader.java PRE-CREATION=20 bq. trunk/ql/src/test/queries/clientpositive/index_compact_binary_search= .q PRE-CREATION=20 bq. trunk/ql/src/test/results/clientpositive/index_compact_binary_search= .q.out PRE-CREATION=20 bq. =20 bq. Diff: https://reviews.apache.org/r/2605/diff bq. =20 bq. =20 bq. Testing bq. ------- bq. =20 bq. I added a test to verify the functionality of the HiveBinarySearchReco= rdReader. bq. =20 bq. I also added a .q file to test that this returns the correct results w= hen the underlying index is stored in an RCFile and when it is stored in as= a text file, with all of the supported operators. bq. =20 bq. I ran the .q files to verify they still pass. bq. =20 bq. I ran some queries to verify there was a CPU benefit to doing this. I= saw as much as a 45% reduction in the total CPU used by the map reduce job= to scan the index, for a large data set.=20 bq. =20 bq. =20 bq. Thanks, bq. =20 bq. Kevin bq. =20 bq. =20 > Use sorted nature of compact indexes > ------------------------------------ > > Key: HIVE-2535 > URL: https://issues.apache.org/jira/browse/HIVE-2535 > Project: Hive > Issue Type: Improvement > Reporter: Kevin Wilfong > Assignee: Kevin Wilfong > Attachments: HIVE-2535.1.patch.txt > > > Compact indexes are sorted based on the indexed columns, but we are not u= sing this fact when we access the index. > To start with, if the index is stored as an RC file, and if the predicate= being used to access the index consists of only one non-partition conditio= n using one of the operators >,>=3D,<,<=3D,=3D we could use a binary search= (if necessary) to find the block to begin scanning for unfiltered rows, an= d we could use the result of comparing the value in the column with the con= stant (this is necessarily the form of a predicate which is optimized using= an index) to determine when we have found all the rows which will be unfil= tered. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs: https://issues.apache.org/jira/secure/ContactAdministrators!default.jsp= a For more information on JIRA, see: http://www.atlassian.com/software/jira