drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "salim achouche (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5879) Optimize "Like" operator
Date Mon, 16 Oct 2017 21:08:00 GMT
salim achouche created DRILL-5879:

             Summary: Optimize "Like" operator
                 Key: DRILL-5879
                 URL: https://issues.apache.org/jira/browse/DRILL-5879
             Project: Apache Drill
          Issue Type: Improvement
          Components: Execution - Relational Operators
         Environment: * 
            Reporter: salim achouche
            Assignee: salim achouche
            Priority: Minor
             Fix For: 1.12.0

Query: select <column-list> from <table> where colA like '%a%' or colA like '%xyz%';

Improvement Opportunity
# Avoid isAscii computation (full access of the input string) since we're dealing with the
same column twice
# Optimize the "contains" for-loop 

Implementation Detail
* Added a new integer variable "asciiMode" to the VarCharHolder class
* The default value is -1 which indicates this info is not know
* Otherwise this value will be set to either 1 or 0
* The execution plan already shares the same VarCharHolder instance for all evaluations of
the same column value
* The asciiMode will be correctly set during the first LIKE evaluation and will be reused
across other LIKE evaluations

* The "Contains" LIKE operation is quite expensive as the code needs to access the input string
to perform character based comparisons
* Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization)
and b) minimize comparisons

* Lineitem table 100GB
* Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment not like
'%a%' or l_comment like '%the%' group by l_returnflag
* Before changes: 33sec
* After changes    : 27sec

This message was sent by Atlassian JIRA

View raw message