hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-738) Regexp passed from pigscript fails in UDF
Date Tue, 22 Sep 2009 01:30:16 GMT

     [ https://issues.apache.org/jira/browse/PIG-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pradeep Kamath updated PIG-738:
-------------------------------

    Attachment: PIG-738.patch

Attached patch which will allow define statements to treat strings the same way as strings
are treated elsewhere in a pig script. So www\\..xyz\\.com will be treated as a java string
and will be sent a regex pattern where the '.' is escaped by a backslash to ensure it is not
treated as the match-all character class. The extra backslash is to escape the backslash itself.

> Regexp passed from pigscript fails in UDF  
> -------------------------------------------
>
>                 Key: PIG-738
>                 URL: https://issues.apache.org/jira/browse/PIG-738
>             Project: Pig
>          Issue Type: Bug
>          Components: grunt
>    Affects Versions: 0.3.0
>            Reporter: Viraj Bhat
>            Assignee: Pradeep Kamath
>             Fix For: 0.6.0
>
>         Attachments: myregexp.jar, PIG-738.patch, RegexGroupCount.java, regexp.pig, regexpinput.txt
>
>
> Consider a pig script which parses and counts regular expressions from a text file. 
> The regular expression supplied in the Pig script needs to escape the "."  (dot) character.
> {code}
> register myregexp.jar;
> -- pattern not picked up
> define minelogs ci_pig_udfs.RegexGroupCount('www\\.yahoo\\.com/sports');
> A = load '/user/viraj/regexpinput.txt'  using PigStorage() as (source : chararray);
> B = foreach A generate minelogs(source) as sportslogs;
> dump B;
> {code}
> Snippet of UDF RegexGroupCount.java
> {code}
> public class RegexGroupCount extends EvalFunc<Integer> {
>     private final Pattern pattern_;
>     public RegexGroupCount(String patternStr) {
>        System.out.println("My pattern supplied is "+patternStr);
>        System.out.println("Equality test "+patternStr.equals("www\\.yahoo\\.com/sports"));
>        pattern_ = Pattern.compile(patternStr, Pattern.DOTALL|Pattern.CASE_INSENSITIVE);
>    }
>   public Integer exec(Tuple input)  throws IOException {
>    }
> }
> {code}
> Running the above script on the following dataset :
> ====================================================================================================
> dshfdskfwww.yahoo.com/sportsjoadfjdslpdshfdskfwww.yahoo.com/sportsjoadfjdsl
> kas;dka;sd
> jsjsjwww.yahoo.com/sports
> jsdLSJDcom/sports
> wwwJyahooMcom/sports
> ====================================================================================================
> Results in the following:
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> Userfunc: (Name: UserFunc viraj-Sat Mar 28 02:06:31 PDT 2009-14 function: ci_pig_udfs.RegexGroupCount('www\\.yahoo\\.com/sports')
Operator Key: viraj-Sat Mar 28 02:06:31 PDT 2009-14)
> Userfunc fs: int
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> My pattern supplied is www\\.yahoo\\.com/sports
> Equality test false
> 2009-03-28 02:06:43,923 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete
> 2009-03-28 02:06:43,923 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Success!
> (0)
> (0)
> (0)
> (0)
> (0)
> ====================================================================================================
> In essence there seems to be no way of passing this type of constructor argument through
the Pig script. The only workaround seems to be hard coding the values in the UDF!!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message