Mailing-List: contact issues-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Date: Thu, 10 Aug 2017 22:03:00 +0000 (UTC)
From: "Padma Penumarthy (JIRA)" <jira@apache.org>
To: issues@drill.apache.org
Message-ID: <JIRA.13091256.1501522257000.4870.1502402580664@Atlassian.JIRA>
In-Reply-To: <JIRA.13091256.1501522257000@Atlassian.JIRA>
References: <JIRA.13091256.1501522257000@Atlassian.JIRA> <JIRA.13091256.1501522257990@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (DRILL-5697) Improve performance of filter
 operator for pattern matching
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Thu, 10 Aug 2017 22:03:06 -0000


    [ https://issues.apache.org/jira/browse/DRILL-5697?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1612=
2402#comment-16122402 ]=20

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:02 PM:
-------------------------------------------------------------------

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all AS=
CII, we directly read the byte as character from PlatformDependent. Else, w=
e decode UTF-8 bytes, copy them to charBuffer and read characters from that=
.=20
2. regex matching is done on this charSequenceWrapper, which provides charA=
t functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a'=20
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a%'
1m 6 sec


For all ASCII, since getByte is doing a bounds check every time we call it,=
 I want to see if getting the bytes  in one shot is better. That did not he=
lp much with performance. In fact, it made it worse for 'a%' type of  match=
.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a%'
55 sec (vs 1min 6 sec baseline)


Use find instead of matcher.matches(). The numbers are better, but not by m=
uch.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like =E2=80=98%a%=E2=80=99;
32 sec (vs 1min 6s baseline)


Next, I tried building charBuffer always (even if it is all ASCII) and use =
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like =E2=80=98%a%=E2=80=99
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like =E2=80=98a%=E2=80=99
46  (vs 9.794s baseline)


I tried Google RE2 library. Got much worse numbers than what we are getting=
 with Java Regex Library.


Finally, I implemented simple character by character comparison functions f=
or each of the special cases=20
and got pretty good numbers.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a%'
11.34s (vs. 1m 6s baseline)


was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all AS=
CII, we directly read the byte as character from PlatformDependent. Else, w=
e decode UTF-8 bytes, copy them to charBuffer and read characters from that=
.=20
2. regex matching is done on this charSequenceWrapper, which provides charA=
t functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a'=20
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a%'
1m 6 sec


For all ASCII, since getByte is doing a bounds check every time we call it,=
 I want to see if getting the bytes  in one shot is better. That did not he=
lp much with performance. In fact, it made it worse for 'a%' type of  match=
.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a%'
55 sec (vs 1min 6 sec baseline)


Use find instead of matcher.matches(). The numbers are better, but not by m=
uch.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like =E2=80=98%a%=E2=80=99;
32 sec (vs 1min 6s baseline)


Next, I tried building charBuffer always (even if it is all ASCII) and use =
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like =E2=80=98%a%=E2=80=99
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like =E2=80=98a%=E2=80=99
46  (vs 9.794s baseline)


I tried Google RE2 library. Got much worse numbers than what we are getting=
 with Java Regex Library.


Finally, I implemented simple character by character comparison functions f=
or each of the special cases=20
and got pretty good numbers.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_=
comment like '%a%'
11.34s (vs. 1m 6s baseline)


> Improve performance of filter operator for pattern matching
> -----------------------------------------------------------
>
>                 Key: DRILL-5697
>                 URL: https://issues.apache.org/jira/browse/DRILL-5697
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Flow
>    Affects Versions: 1.11.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for pa=
ttern matching. However, for cases like %abc (ends with abc), abc% (starts =
with abc), %abc% (contains abc), it is observed that implementing these cas=
es with simple code instead of using regex library provides good performanc=
e boost (4-6x). Idea is to use special case code for simple, common cases a=
nd fall back to Java regex library for complicated ones. That will provide =
good performance benefit for most common cases.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)