drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sachouche <...@git.apache.org>
Subject [GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Date Wed, 13 Dec 2017 23:04:28 GMT
GitHub user sachouche opened a pull request:

    https://github.com/apache/drill/pull/1072

    DRILL-5879: Improved SQL Pattern Contains Performance

    **BACKGROUND**
    - JIRA [DRILL-5879](https://issues.apache.org/jira/browse/DRILL-5879) goal is to improve
the "Contains" pattern performance
    - [DRILL-5899](https://issues.apache.org/jira/browse/DRILL-5899) (sub-task) was created
subsequently to avoid the ASCII / Unicode pre-processing overhead.
    - This pull-request addresses the algorithmic part of this functionality
    
    **ANALYSIS**
    - Contains has O(n*m) complexity
    - There are two ways to optimize the associated runtime
    1) Minimize the number of instructions, pipelining, and CPU stalls
    2) Use smarter algorithms (compared to the naive one); for example, the Boyer-Moore algorithm
(which is implemented by several popular Open Source tools such as grep)
    
    **IMPLEMENTATION**
    Our approach contains both suggestions (listed in the analysis)
    - Created five matchers that are chosen based on the pattern length
    - The first three, are based on pattern 1) and target patterns with length [1..4[ 
    - The forth one, has a similar runtime then the current implementation and targets patterns
with length [4..10[
    - We use the the Boyer-Moore algorithm for patterns with a length larger than 9 bytes
    NOTE - the JDK doesn't use this algorithm because of two main  reasons
    - Two extra arrays are necessary with a size relative to the supported character-set and
the pattern length (this would be particularly costly for unicode as this would require 64k
entries)
    - Each Contains (or indexOf) invocation would require memory and initialization overhead
    - Drill doesn't have this issue as a) initialization overhead is amortized since the pattern
will be matched against many input values and b) our Contains logic is centered around bytes
so the memory overhead is around 256 integers per fragment
    
    **PERFORMANCE IMPROVEMENTS**
    - We observe at least 25% improvement per Contains operation for matchers with pattern
length lower than 4; 100% for negative cases (as the code never accesses a secondary for-loop)
    - It was noticed the Boyer-Moor algorithm performs poorly for small patterns as lookup
accesses erase the associated optimizations; this algorithm performs extremely well when the
pattern length increases as unlike the naive implementation it is able to use the lookup table
to jump beyond the next character (since the matching phase has already gained so insight).
    - One popular introduction to this algorithm is the following use-case
    - Input: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    - Pattern: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaax
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachouche/drill DRILL-5879

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1072
    
----
commit 50ac5140420057692cac8a33f181eb16580a28e6
Author: Salim Achouche <sachouche2@gmail.com>
Date:   2017-12-13T22:24:45Z

    DRILL-5879: Improved SQL Pattern Contains Performance

----


---

Mime
View raw message