pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zach Bailey <zach.bai...@dataclip.com>
Subject Re: Regex Match Tagger UDF?
Date Mon, 06 Dec 2010 22:25:10 GMT

 Here you go:


https://github.com/znbailey/Dataclip-Piggybank


The UDF you'll be interested in is here:


https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java


I would recommend grabbing the entire repo as that UDF depends on the repackaged version of
Aho-Corasick in org/arabidopsis/ahocorasick


Enjoy,
Zach


On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:

> No problem.
> Sounds good. And no worry about messy code. We are all well aware that code often elegance
when you are just trying to get it out the door.
> -----Original Message-----
> From: Zach Bailey [mailto:zach.bailey@dataclip.com] 
> Sent: Monday, December 06, 2010 4:46 PM
> To: user@pig.apache.org
> Subject: Re: Regex Match Tagger UDF?
> 
> 
>  Great. Let me clean up the code a bit and I'd be happy to post it. I'm definitely open
to some alternatives in terms of how this UDF would be initialized, whether it is via a file
sitting on HDFS, etc. The current initialization scheme is admittedly crude but was simple
to code and works for us for now.
> 
> Cheers,
> Zach
> 
> 
> On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:
> 
> 
> >  That is an interesting approach. I like it. Not ideal, but I think it could work
for what I am doing.
> > 
> >  In general I think that is useful to the community and you should github it. 
> >  By all means, I would love to use this.
> > 
> >  I think I could extend/fork this for my need.
> > 
> >  Thank you Zach!
> > 
> >  -----Original Message-----
> >  From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> >  Sent: Monday, December 06, 2010 3:38 PM
> >  To: user@pig.apache.org
> >  Subject: Re: Regex Match Tagger UDF?
> > 
> > 
> >  Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick
algorithm [1] to do something similar to what you're asking for. It works as follows:
> > 
> > 
> >  1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a
result to output when that token is found:
> > 
> > 
> >  define AC_MATCHER 
> >  com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit 
> >  bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')
> > 
> > 
> >  2.) apply the AC_MATCHER to a tuple
> > 
> > 
> >  strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings = 
> >  FOREACH strings GENERATE string, AC_MATCHER(string) as tags;
> > 
> > 
> >  The tagged_strings will then contain the original line along with a bag of matches.
For instance if we had the following in myfile.txt:
> > 
> > 
> >  terrier parakeet
> >  hello
> >  goodbye
> >  tabby
> >  pit bull
> > 
> > 
> >  after running the commands in #2 tagged_strings would look like (pardon the ad-hoc
notation):
> > 
> > 
> >  { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string: 
> >  'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby', 
> >  tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }
> > 
> > 
> >  If this is something you'd be interested in using/extended I can put it up on github
for your forking pleasure.
> > 
> >  Cheers,
> >  Zach
> > 
> > 
> >  On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:
> > 
> > 
> > > I have al is of regex patterns that I would like to run against a 
> > > data set, and if it matches a particular pattern in the list, tag 
> > > it with the predefined tag for that pattern.
> > > Has this been done, or available somewhere? 
> > > I've not written any UDF's, and although I'm not against doing so, 
> > > I probably don't have the time to write one at this point.
> > > 
> > > If this isn't available somewhere I can work around this roadblock, 
> > > but it would be awesome if someone has cooked up this functionality 
> > > somewhere.
> > > 
> > > -----Original Message-----
> > > From: Anze [mailto:anzenews@volja.net]
> > > Sent: Monday, December 06, 2010 3:09 PM
> > > To: user@pig.apache.org
> > > Subject: Re: Easy question...difference between this::form and 
> > > this.form?
> > > 
> > > 
> > > Sorry to hijack your question, Jonathan, but while we are at it... 
> > > :)
> > > 
> > > Is there a way to tell Pig NOT to add "base_alias::"? Almost half 
> > > my code consists of FOREACH... GENERATE that just remove these prefixes.
> > > 
> > > Thanks,
> > > 
> > > Anze
> > > 
> > > On Monday 06 December 2010, Daniel Dai wrote:
> > > 
> > > > After join, cross, foreach flatten, Pig will automatically add 
> > > > "base_alias::" prefix. All other cases use "."
> > > > 
> > > > Daniel
> > > > 
> > > > Jonathan Coveney wrote:
> > > > > It's very hard to search for this among the docs because it's so
> > > > 
> > > > 
> > > generic,
> > > 
> > > > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > > > 
> > > > > Taking a look at this code that I found online, for example
> > > > > 
> > > > > --
> > > > > -- Read in a bag of tuples (timeseries for this example) and 
> > > > > divide
> > > > 
> > > > 
> > > the
> > > 
> > > > > -- numeric column by its maximum.
> > > > > --
> > > > > %default DATABAG 'data/timeseries.tsv'
> > > > > 
> > > > > data = LOAD '$DATABAG' AS (month:chararray, count:int); 
> > > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate 
> > > > > GENERATE FLATTEN(data),
> > > > > MAX(data.count) AS max_count;
> > > > > normalize = FOREACH calc_max GENERATE data::month AS month, 
> > > > > data::count AS count, (float)data::count / (float)max_count AS 
> > > > > normed_count; DUMP normalize;
> > > > > 
> > > > > What purpose does data::month serve versus data.count?
> > > > > 
> > > > > Thanks
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 



Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message