pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Adams" <Brian.Ad...@chacha.com>
Subject RE: Regex Match Tagger UDF?
Date Mon, 06 Dec 2010 21:15:09 GMT
That is an interesting approach. I like it. Not ideal, but I think it could work for what I
am doing.

In general I think that is useful to the community and you should github it. 
By all means, I would love to use this.

I think I could extend/fork this for my need.

Thank you  Zach!

-----Original Message-----
From: Zach Bailey [mailto:zach.bailey@dataclip.com] 
Sent: Monday, December 06, 2010 3:38 PM
To: user@pig.apache.org
Subject: Re: Regex Match Tagger UDF?


 Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick
algorithm [1] to do something similar to what you're asking for. It works as follows:


1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result to output
when that token is found:


define AC_MATCHER com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')


2.) apply the AC_MATCHER to a tuple


strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings = FOREACH strings GENERATE
string, AC_MATCHER(string) as tags;


The tagged_strings will then contain the original line along with a bag of matches. For instance
if we had the following in myfile.txt:


terrier parakeet
hello
goodbye
tabby
pit bull


after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):


{ string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string: 'hello', tags: {} } {
string: 'goodbye', tags: {} } { string: 'tabby', tags: { 'cats' } } { string: 'pit bull',
tags: { 'dogs' } }


If this is something you'd be interested in using/extended I can put it up on github for your
forking pleasure.

Cheers,
Zach


On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

> I have al is of regex patterns that I would like to run against a data 
> set, and if it matches a particular pattern in the list, tag it with 
> the predefined tag for that pattern.
> Has this been done, or available somewhere? 
> I've not written any UDF's, and although I'm not against doing so, I 
> probably don't have the time to write one at this point.
> 
> If this isn't available somewhere I can work around this roadblock, 
> but it would be awesome if someone has cooked up this functionality 
> somewhere.
> 
> -----Original Message-----
> From: Anze [mailto:anzenews@volja.net]
> Sent: Monday, December 06, 2010 3:09 PM
> To: user@pig.apache.org
> Subject: Re: Easy question...difference between this::form and 
> this.form?
> 
> 
> Sorry to hijack your question, Jonathan, but while we are at it... :)
> 
> Is there a way to tell Pig NOT to add "base_alias::"? Almost half my 
> code consists of FOREACH... GENERATE that just remove these prefixes.
> 
> Thanks,
> 
> Anze
> 
> On Monday 06 December 2010, Daniel Dai wrote:
> 
> >  After join, cross, foreach flatten, Pig will automatically add  
> > "base_alias::" prefix. All other cases use "."
> > 
> >  Daniel
> > 
> >  Jonathan Coveney wrote:
> > > It's very hard to search for this among the docs because it's so
> > 
> > 
> generic,
> 
> > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > 
> > > Taking a look at this code that I found online, for example
> > > 
> > > --
> > > -- Read in a bag of tuples (timeseries for this example) and 
> > > divide
> > 
> > 
> the
> 
> > > -- numeric column by its maximum.
> > > --
> > > %default DATABAG 'data/timeseries.tsv'
> > > 
> > > data = LOAD '$DATABAG' AS (month:chararray, count:int); accumulate 
> > > = GROUP data ALL; calc_max = FOREACH accumulate GENERATE 
> > > FLATTEN(data),
> > > MAX(data.count) AS max_count;
> > > normalize = FOREACH calc_max GENERATE data::month AS month, 
> > > data::count AS count, (float)data::count / (float)max_count AS 
> > > normed_count; DUMP normalize;
> > > 
> > > What purpose does data::month serve versus data.count?
> > > 
> > > Thanks
> > 
> > 
> 
> 
> 
> 


Mime
View raw message