Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Tue, 1 Nov 2011 18:01:36 +0000 (UTC)
From: "jiraposter@reviews.apache.org (Commented) (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: 
 <2099753342.46763.1320170496469.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <1268566642.34047.1319839115079.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/HIVE-2535?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13141=
404#comment-13141404 ]=20

jiraposter@reviews.apache.org commented on HIVE-2535:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/#review2974
-----------------------------------------------------------


trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<https://reviews.apache.org/r/2605/#comment6684>

    The default can be true


trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java
<https://reviews.apache.org/r/2605/#comment6692>

    nit: spelling=20


trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java
<https://reviews.apache.org/r/2605/#comment6685>

    More comments here.
    It would be useful to describe when is a binary search
    performed.


trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandl=
er.java
<https://reviews.apache.org/r/2605/#comment6694>

    This should not be hard-coded.
    If user wanted HiveInputFormat, it should be=20
    HiveSortedInputFormat and same for CombineHiveSortedInputFormat.
   =20
    Do we need a new class, or can sorted be a=20
    property of input format ? Then, it should automatcally
    work for both hiveIF and combinehiveIF


trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandl=
er.java
<https://reviews.apache.org/r/2605/#comment6695>

    use the term index column instead of non-partition column.
   =20
    Who is using the function findNonPartitionFilterWork.
    It is not modifying any internal structure, and the=20
    return value is not used
   =20


trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandl=
er.java
<https://reviews.apache.org/r/2605/#comment6696>

    I am confused - what if the filter contains multiple
    non partition column predicates ?


trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java
<https://reviews.apache.org/r/2605/#comment6698>

    As mentioned before, it would be good if this also works with CombineHi=
veIF


- namit


On 2011-10-29 01:39:50, Kevin Wilfong wrote:
bq. =20
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2605/
bq.  -----------------------------------------------------------
bq. =20
bq.  (Updated 2011-10-29 01:39:50)
bq. =20
bq. =20
bq.  Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
bq. =20
bq. =20
bq.  Summary
bq.  -------
bq. =20
bq.  The CompactIndexHandler determines if the reentrant query it creates i=
s a candidate for using the fact the index is sorted (it has an appropriate=
 number of non-partition conditions, and the query plan is of the form expe=
cted).  It sets the input format to HiveSortedInputFormat, and marks the Fi=
lterOperator for the non-partition condition.
bq. =20
bq.  The HiveSortedInputFormat is extends HiveInputFormat, so its splits co=
nsist of data from a single file, and its record reader is HiveBinarySearch=
RecordReader.  HiveBinarySearchRecordReader starts by assuming it is perfor=
ming a binary search.  It sets the appropriate flags in IOContext, which ac=
ts as the means of communication between the FilterOperators and the record=
 reader.  The non-partition FilterOperator is responsible for executing a c=
omparison between the value in the row and column of interest and the const=
ant.  It also provides the type of the generic UDF.  It sets this data in t=
he IOContext.  As long as the binary search continues the FilterOperators d=
o not forward rows to the operators below them.  The record reader uses the=
 comparison and the type of the generic UDF to execute a binary search on t=
he underlying RCFile until it finds the block of interest, or determines th=
at if any block is of interest it is the last one.  The search then proceed=
s linearly from the beginning of the identified block.  If ever in the bina=
ry search a problem occurs, like the comparison fails for some reason, a li=
near search begins from the beginning of the data which has yet to be elimi=
nated.
bq. =20
bq.  Regardless of whether or not a binary search is performed, the record =
reader attempts to end the linear search as soon as it can based on the com=
parison and the type of the generic UDF.
bq. =20
bq. =20
bq.  This addresses bug HIVE-2535.
bq.      https://issues.apache.org/jira/browse/HIVE-2535
bq. =20
bq. =20
bq.  Diffs
bq.  -----
bq. =20
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183=
507=20
bq.    trunk/conf/hive-default.xml 1183507=20
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFunc=
Evaluator.java 1183507=20
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java=
 1183507=20
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactInd=
exHandler.java 1183507=20
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecor=
dReader.java PRE-CREATION=20
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java =
1183507=20
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java=
 1183507=20
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat=
.java PRE-CREATION=20
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 118350=
7=20
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507=
=20
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.ja=
va 1183507=20
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 118=
3507=20
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBa=
seCompare.java 1183507=20
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedIn=
putFormatUsedHook.java PRE-CREATION=20
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchR=
ecordReader.java PRE-CREATION=20
bq.    trunk/ql/src/test/queries/clientpositive/index_compact_binary_search=
.q PRE-CREATION=20
bq.    trunk/ql/src/test/results/clientpositive/index_compact_binary_search=
.q.out PRE-CREATION=20
bq. =20
bq.  Diff: https://reviews.apache.org/r/2605/diff
bq. =20
bq. =20
bq.  Testing
bq.  -------
bq. =20
bq.  I added a test to verify the functionality of the HiveBinarySearchReco=
rdReader.
bq. =20
bq.  I also added a .q file to test that this returns the correct results w=
hen the underlying index is stored in an RCFile and when it is stored in as=
 a text file, with all of the supported operators.
bq. =20
bq.  I ran the .q files to verify they still pass.
bq. =20
bq.  I ran some queries to verify there was a CPU benefit to doing this.  I=
 saw as much as a 45% reduction in the total CPU used by the map reduce job=
 to scan the index, for a large data set.=20
bq. =20
bq. =20
bq.  Thanks,
bq. =20
bq.  Kevin
bq. =20
bq.


               =20
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not u=
sing this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate=
 being used to access the index consists of only one non-partition conditio=
n using one of the operators >,>=3D,<,<=3D,=3D we could use a binary search=
 (if necessary) to find the block to begin scanning for unfiltered rows, an=
d we could use the result of comparing the value in the column with the con=
stant (this is necessarily the form of a predicate which is optimized using=
 an index) to determine when we have found all the rows which will be unfil=
tered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs: https://issues.apache.org/jira/secure/ContactAdministrators!default.jsp=
a
For more information on JIRA, see: http://www.atlassian.com/software/jira