pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eyal Allweil (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-4803) Improve performance of regex-based builtin functions
Date Wed, 10 Feb 2016 18:17:18 GMT

     [ https://issues.apache.org/jira/browse/PIG-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Eyal Allweil updated PIG-4803:
    Attachment: PIG-4803.patch

This patch gives REPLACE a similar compiled pattern caching strategy to that of REGEX_EXTRACT

Because the existing implementation swallows exceptions with a warning, I did the same in
my implementation (although they are different, more informative warnings).

I changed TestStringUDFs.TestReplace slightly to cover the new code.

> Improve performance of regex-based builtin functions
> ----------------------------------------------------
>                 Key: PIG-4803
>                 URL: https://issues.apache.org/jira/browse/PIG-4803
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>         Attachments: PIG-4803.patch
> There are three strategies used by Pig's regex-based built in functions.
> 1) REPLACE doesn't do any pattern caching.
> 2) REGEX_EXTRACT and REGEX_EXTRACT_ALL attempt to cache a single pattern as an instance
> 3) PluckTuple attempts to cache a single pattern statically. (doesn't this cause problems
if two clashing defines for different PluckTuples are used?)
> I have a little fix and a medium fix in mind. The little fix is to give REPLACE a similar
caching strategy, and to fix PluckTuple, if the static nature of the pattern is indeed a problem.
> The medium fix is to make all four functions take an additional constructor with a constant
regex (and therefore one less argument in evaluation) and use that if it exists. This would
be backwards compatible, should barely (or not) affect the performance of the existing code
path, but I think that in cases where there are two clashing usages of the functions in the
same foreach..generate it would allow the pattern caching to work.

This message was sent by Atlassian JIRA

View raw message