pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ankit Modi (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
Date Tue, 08 Dec 2009 07:04:18 GMT

     [ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ankit Modi updated PIG-965:
---------------------------

    Attachment: poregex2.patch
                poregex.patch

These are patches for two implementations 

One (poregex.patch) is an implementation applying optimization mentioned above in the JIRA.
Second (poregex2.patch) implementation applies optimization 1 and uses dk.brics.automaton
for running simple regular expressions. Otherwise it reverts back to java.util.regex.

In 1 the decision to use optimization two or use java.util.regex is decided by getSimpleString
method

In 2 the decision to use dk.brics.automaton is done by determineBestRegexMethod. ( changes
to build.xml is this patch are temporary )

Both patches use RegexInit as an implementation which makes a decision ( calling the above
mentioned  decision functions ) and then sets the implementation to one decided by the decision
function.

In second patch, the decision function was created looking at the support of operators in
dk.brics.automaton and its grammar. I tried out the classes supported and not supported in
dk.brics.automaton and decided upon it.

I could not find any specific page mentioning the difference between regex language java.util.regex
and dk.brics.automaton.

> PERFORMANCE: optimize common case in matches (PORegex)
> ------------------------------------------------------
>
>                 Key: PIG-965
>                 URL: https://issues.apache.org/jira/browse/PIG-965
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Thejas M Nair
>            Assignee: Ankit Modi
>         Attachments: poregex.patch, poregex2.patch
>
>
> Some frequently seen use cases of 'matches' comparison operator have follow properties
-
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. eg - "abc%',
"%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed.

> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message