hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes
Date Mon, 14 Nov 2011 20:10:54 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149864#comment-13149864
] 

jiraposter@reviews.apache.org commented on HIVE-2535:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/
-----------------------------------------------------------

(Updated 2011-11-14 20:10:27.544678)


Review request for hive, Yongqiang He, Ning Zhang, and namit jain.


Changes
-------

This diff is in response to Yongqiang's comments.

He pointed out that if the ExprNode in a FilterOperator is a tree, rather than a single operator,
the optimizations will not be applied.  This was fixed by marking the FilterOperator, and
the ExprNode in the tree, where the index column is compared.  When the FilterOperator needs
to execute a comparison, it traverses the tree of ExprNodes until it identifies the marked
ExprNode which has the context necessary to perform the comparison.

There was also an issue where CombineHiveRecordReaders were not always being marked as sorted
when they should.  This was fixed by marking them as sorted as part of the initIOContext call,
instead of as part of the HiveContextAwareRecordReader constructor.

Instead of determining if the input format class supports making use of sorted input by string
comparison of the class name in the CompactIndexHandler, I check whether or not the class
extends HiveInputFormat.  Currently this is sufficient as all of the InputFormats introduced
in Hive extend this, and I ensured that all such InputFormats do indeed support making use
of sorted input.


Summary
-------

The CompactIndexHandler determines if the reentrant query it creates is a candidate for using
the fact the index is sorted (it has an appropriate number of non-partition conditions, and
the query plan is of the form expected).  It sets the input format to HiveSortedInputFormat,
and marks the FilterOperator for the non-partition condition.

The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist of data from a
single file, and its record reader is HiveBinarySearchRecordReader.  HiveBinarySearchRecordReader
starts by assuming it is performing a binary search.  It sets the appropriate flags in IOContext,
which acts as the means of communication between the FilterOperators and the record reader.
 The non-partition FilterOperator is responsible for executing a comparison between the value
in the row and column of interest and the constant.  It also provides the type of the generic
UDF.  It sets this data in the IOContext.  As long as the binary search continues the FilterOperators
do not forward rows to the operators below them.  The record reader uses the comparison and
the type of the generic UDF to execute a binary search on the underlying RCFile until it finds
the block of interest, or determines that if any block is of interest it is the last one.
 The search then proceeds linearly from the beginning of the identified block.  If ever in
the binary search a problem occurs, like the comparison fails for some reason, a linear search
begins from the beginning of the data which has yet to be eliminated.

Regardless of whether or not a binary search is performed, the record reader attempts to end
the linear search as soon as it can based on the comparison and the type of the generic UDF.


This addresses bug HIVE-2535.
    https://issues.apache.org/jira/browse/HIVE-2535


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 
  trunk/conf/hive-default.xml 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java 1183507

  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 1183507

  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/BucketizedHiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java 1183507

  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeGenericFuncDesc.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java 1183507

  trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java
PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java PRE-CREATION

  trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out PRE-CREATION


Diff: https://reviews.apache.org/r/2605/diff


Testing
-------

I added a test to verify the functionality of the HiveBinarySearchRecordReader.

I also added a .q file to test that this returns the correct results when the underlying index
is stored in an RCFile and when it is stored in as a text file, with all of the supported
operators.

I ran the .q files to verify they still pass.

I ran some queries to verify there was a CPU benefit to doing this.  I saw as much as a 45%
reduction in the total CPU used by the map reduce job to scan the index, for a large data
set. 


Thanks,

Kevin


                
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt, HIVE-2535.2.patch.txt, HIVE-2535.3.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact
when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used
to access the index consists of only one non-partition condition using one of the operators
>,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin
scanning for unfiltered rows, and we could use the result of comparing the value in the column
with the constant (this is necessarily the form of a predicate which is optimized using an
index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message