pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zach Bailey <zach.bai...@dataclip.com>
Subject Re: Regex Match Tagger UDF?
Date Mon, 06 Dec 2010 20:37:43 GMT

 Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick
algorithm [1] to do something similar to what you're asking for. It works as follows:


1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result to output
when that token is found:


define AC_MATCHER com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')


2.) apply the AC_MATCHER to a tuple


strings = LOAD 'myfile.txt' as (string:chararray);
tagged_strings = FOREACH strings GENERATE string, AC_MATCHER(string) as tags;


The tagged_strings will then contain the original line along with a bag of matches. For instance
if we had the following in myfile.txt:


terrier parakeet
hello
goodbye
tabby
pit bull


after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):


{ string: 'terrier parakeet', tags: { 'dogs', 'birds' } }
{ string: 'hello', tags: {} }
{ string: 'goodbye', tags: {} }
{ string: 'tabby', tags: { 'cats' } }
{ string: 'pit bull', tags: { 'dogs' } }


If this is something you'd be interested in using/extended I can put it up on github for your
forking pleasure.

Cheers,
Zach


On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

> I have al is of regex patterns that I would like to run against a data
> set, and if it matches a particular pattern in the list, tag it with the
> predefined tag for that pattern.
> Has this been done, or available somewhere? 
> I've not written any UDF's, and although I'm not against doing so, I
> probably don't have the time to write one at this point.
> 
> If this isn't available somewhere I can work around this roadblock, but
> it would be awesome if someone has cooked up this functionality
> somewhere.
> 
> -----Original Message-----
> From: Anze [mailto:anzenews@volja.net] 
> Sent: Monday, December 06, 2010 3:09 PM
> To: user@pig.apache.org
> Subject: Re: Easy question...difference between this::form and
> this.form?
> 
> 
> Sorry to hijack your question, Jonathan, but while we are at it... :) 
> 
> Is there a way to tell Pig NOT to add "base_alias::"? Almost half my
> code 
> consists of FOREACH... GENERATE that just remove these prefixes. 
> 
> Thanks,
> 
> Anze
> 
> On Monday 06 December 2010, Daniel Dai wrote:
> 
> >  After join, cross, foreach flatten, Pig will automatically add
> >  "base_alias::" prefix. All other cases use "."
> > 
> >  Daniel
> > 
> >  Jonathan Coveney wrote:
> > > It's very hard to search for this among the docs because it's so
> > 
> > 
> generic,
> 
> > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > 
> > > Taking a look at this code that I found online, for example
> > > 
> > > --
> > > -- Read in a bag of tuples (timeseries for this example) and divide
> > 
> > 
> the
> 
> > > -- numeric column by its maximum.
> > > --
> > > %default DATABAG 'data/timeseries.tsv'
> > > 
> > > data = LOAD '$DATABAG' AS (month:chararray, count:int);
> > > accumulate = GROUP data ALL;
> > > calc_max = FOREACH accumulate GENERATE FLATTEN(data),
> > > MAX(data.count) AS max_count;
> > > normalize = FOREACH calc_max GENERATE data::month AS month,
> > > data::count AS count, (float)data::count / (float)max_count AS
> > > normed_count;
> > > DUMP normalize;
> > > 
> > > What purpose does data::month serve versus data.count?
> > > 
> > > Thanks
> > 
> > 
> 
> 
> 
> 



Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message