pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Ryaboy <dvrya...@gmail.com>
Subject Re: Regex Match Tagger UDF?
Date Tue, 07 Dec 2010 02:26:01 GMT
Zach,
Do you mind contributing that directly to the Piggybank's upcoming home,
https://github.com/wilbur/Piggybank ?

D

On Mon, Dec 6, 2010 at 2:25 PM, Zach Bailey <zach.bailey@dataclip.com>wrote:

>
>  Here you go:
>
>
> https://github.com/znbailey/Dataclip-Piggybank
>
>
> The UDF you'll be interested in is here:
>
>
>
> https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java
>
>
> I would recommend grabbing the entire repo as that UDF depends on the
> repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick
>
>
> Enjoy,
> Zach
>
>
> On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:
>
> > No problem.
> > Sounds good. And no worry about messy code. We are all well aware that
> code often elegance when you are just trying to get it out the door.
> > -----Original Message-----
> > From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> > Sent: Monday, December 06, 2010 4:46 PM
> > To: user@pig.apache.org
> > Subject: Re: Regex Match Tagger UDF?
> >
> >
> >  Great. Let me clean up the code a bit and I'd be happy to post it. I'm
> definitely open to some alternatives in terms of how this UDF would be
> initialized, whether it is via a file sitting on HDFS, etc. The current
> initialization scheme is admittedly crude but was simple to code and works
> for us for now.
> >
> > Cheers,
> > Zach
> >
> >
> > On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:
> >
> >
> > >  That is an interesting approach. I like it. Not ideal, but I think it
> could work for what I am doing.
> > >
> > >  In general I think that is useful to the community and you should
> github it.
> > >  By all means, I would love to use this.
> > >
> > >  I think I could extend/fork this for my need.
> > >
> > >  Thank you Zach!
> > >
> > >  -----Original Message-----
> > >  From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> > >  Sent: Monday, December 06, 2010 3:38 PM
> > >  To: user@pig.apache.org
> > >  Subject: Re: Regex Match Tagger UDF?
> > >
> > >
> > >  Does the UDF have to support regular expressions? If not, I have
> adapted the Aho-Corasick algorithm [1] to do something similar to what
> you're asking for. It works as follows:
> > >
> > >
> > >  1.) Initialize the Aho-Corasick UDF with a list of tokens to search
> for, and a result to output when that token is found:
> > >
> > >
> > >  define AC_MATCHER
> > >  com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit
> > >  bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')
> > >
> > >
> > >  2.) apply the AC_MATCHER to a tuple
> > >
> > >
> > >  strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings =
> > >  FOREACH strings GENERATE string, AC_MATCHER(string) as tags;
> > >
> > >
> > >  The tagged_strings will then contain the original line along with a
> bag of matches. For instance if we had the following in myfile.txt:
> > >
> > >
> > >  terrier parakeet
> > >  hello
> > >  goodbye
> > >  tabby
> > >  pit bull
> > >
> > >
> > >  after running the commands in #2 tagged_strings would look like
> (pardon the ad-hoc notation):
> > >
> > >
> > >  { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string:
> > >  'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby',
> > >  tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }
> > >
> > >
> > >  If this is something you'd be interested in using/extended I can put
> it up on github for your forking pleasure.
> > >
> > >  Cheers,
> > >  Zach
> > >
> > >
> > >  On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:
> > >
> > >
> > > > I have al is of regex patterns that I would like to run against a
> > > > data set, and if it matches a particular pattern in the list, tag
> > > > it with the predefined tag for that pattern.
> > > > Has this been done, or available somewhere?
> > > > I've not written any UDF's, and although I'm not against doing so,
> > > > I probably don't have the time to write one at this point.
> > > >
> > > > If this isn't available somewhere I can work around this roadblock,
> > > > but it would be awesome if someone has cooked up this functionality
> > > > somewhere.
> > > >
> > > > -----Original Message-----
> > > > From: Anze [mailto:anzenews@volja.net]
> > > > Sent: Monday, December 06, 2010 3:09 PM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Easy question...difference between this::form and
> > > > this.form?
> > > >
> > > >
> > > > Sorry to hijack your question, Jonathan, but while we are at it...
> > > > :)
> > > >
> > > > Is there a way to tell Pig NOT to add "base_alias::"? Almost half
> > > > my code consists of FOREACH... GENERATE that just remove these
> prefixes.
> > > >
> > > > Thanks,
> > > >
> > > > Anze
> > > >
> > > > On Monday 06 December 2010, Daniel Dai wrote:
> > > >
> > > > > After join, cross, foreach flatten, Pig will automatically add
> > > > > "base_alias::" prefix. All other cases use "."
> > > > >
> > > > > Daniel
> > > > >
> > > > > Jonathan Coveney wrote:
> > > > > > It's very hard to search for this among the docs because it's
so
> > > > >
> > > > >
> > > > generic,
> > > >
> > > > > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > > > >
> > > > > > Taking a look at this code that I found online, for example
> > > > > >
> > > > > > --
> > > > > > -- Read in a bag of tuples (timeseries for this example) and
> > > > > > divide
> > > > >
> > > > >
> > > > the
> > > >
> > > > > > -- numeric column by its maximum.
> > > > > > --
> > > > > > %default DATABAG 'data/timeseries.tsv'
> > > > > >
> > > > > > data = LOAD '$DATABAG' AS (month:chararray, count:int);
> > > > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate
> > > > > > GENERATE FLATTEN(data),
> > > > > > MAX(data.count) AS max_count;
> > > > > > normalize = FOREACH calc_max GENERATE data::month AS month,
> > > > > > data::count AS count, (float)data::count / (float)max_count
AS
> > > > > > normed_count; DUMP normalize;
> > > > > >
> > > > > > What purpose does data::month serve versus data.count?
> > > > > >
> > > > > > Thanks
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message