pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ankit Modi (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
Date Tue, 08 Dec 2009 07:04:18 GMT

     [ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Ankit Modi updated PIG-965:

    Attachment: poregex2.patch

These are patches for two implementations 

One (poregex.patch) is an implementation applying optimization mentioned above in the JIRA.
Second (poregex2.patch) implementation applies optimization 1 and uses dk.brics.automaton
for running simple regular expressions. Otherwise it reverts back to java.util.regex.

In 1 the decision to use optimization two or use java.util.regex is decided by getSimpleString

In 2 the decision to use dk.brics.automaton is done by determineBestRegexMethod. ( changes
to build.xml is this patch are temporary )

Both patches use RegexInit as an implementation which makes a decision ( calling the above
mentioned  decision functions ) and then sets the implementation to one decided by the decision

In second patch, the decision function was created looking at the support of operators in
dk.brics.automaton and its grammar. I tried out the classes supported and not supported in
dk.brics.automaton and decided upon it.

I could not find any specific page mentioning the difference between regex language java.util.regex
and dk.brics.automaton.

> PERFORMANCE: optimize common case in matches (PORegex)
> ------------------------------------------------------
>                 Key: PIG-965
>                 URL: https://issues.apache.org/jira/browse/PIG-965
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Thejas M Nair
>            Assignee: Ankit Modi
>         Attachments: poregex.patch, poregex2.patch
> Some frequently seen use cases of 'matches' comparison operator have follow properties
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. eg - "abc%',
"%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed.

> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message