pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Howard (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4507) Problem with REGEX which just match for the first word
Date Wed, 15 Apr 2015 17:13:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496536#comment-14496536

Michael Howard commented on PIG-4507:

My understanding is as follows:

You would like to use a regex to map sequences of non-word characters to a single space, so
that what you are left with is a string of "words" separated by a single space char. 
You want to map a string to a string. 

The REGEX_EXTRACT_ALL function is designed to map a string which contains structure into a
tuple ... Extract the fields out of a structured string to return a tuple/record. (REGEX_EXTRACT
does the same thing, only for a single field.) Part of the structure of the string that you
provide is that it contains a fixed number of fields. To the best of my knowledge, there isn't
any way to specify variable numbers of groups in a regex. 

I don't think REGEX_EXTRACT_ALL is what you want to use. 

I suggest that you want to use the pig REPLACE function instead of REGEX_EXTRACT_ALL. This
will allow you to replace sequences of non-word chars with a single space. I think it should
be more-or-less like:

  REPLACE(dirty_string, '\W+', ' ') AS clean_string

Good luck.

> Problem with REGEX which just match for the first word
> ------------------------------------------------------
>                 Key: PIG-4507
>                 URL: https://issues.apache.org/jira/browse/PIG-4507
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0
>         Environment: IBM Infosphere BigInsights v3.0.0.1
>            Reporter: Adrien Bidault
>   Original Estimate: 6h
>  Remaining Estimate: 6h
> I am trying to eliminate punctuation and special symbols from a string using REGEX of
a type "(\\w+)". The problem is that this REGEX treatment is applied to the first word of
the string only.
> Example:
> clean3 = FOREACH clean1 GENERATE id, REGEX_EXTRACT_ALL('toto,  likes ... to play ', '(\\w+)');
> It just resturn "toto" instead of "toto likes to play"
> Would you guys have any ideas?

This message was sent by Atlassian JIRA

View raw message