drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Padma Penumarthy (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching
Date Thu, 10 Aug 2017 22:07:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:06 PM:
-------------------------------------------------------------------

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, we directly
read the byte as character from PlatformDependent. Else, we decode UTF-8 bytes, copy them
to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt functionality
as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a'

1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use String functions
startsWith, endsWith and contains. Numbers are better. But, not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting with Java Regex
Library.*


* Finally, I implemented simple character by character comparison functions for each of the
special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, we directly
read the byte as character from PlatformDependent. Else, we decode UTF-8 bytes, copy them
to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt functionality
as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a'

1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use String functions
startsWith, endsWith and contains. Numbers are better. But, not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting with Java Regex
Library.


Finally, I implemented simple character by character comparison functions for each of the
special cases 
and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a%'
11.34s (vs. 1m 6s baseline)













> Improve performance of filter operator for pattern matching
> -----------------------------------------------------------
>
>                 Key: DRILL-5697
>                 URL: https://issues.apache.org/jira/browse/DRILL-5697
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Flow
>    Affects Versions: 1.11.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for pattern matching.
However, for cases like %abc (ends with abc), abc% (starts with abc), %abc% (contains abc),
it is observed that implementing these cases with simple code instead of using regex library
provides good performance boost (4-6x). Idea is to use special case code for simple, common
cases and fall back to Java regex library for complicated ones. That will provide good performance
benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message