pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Adams" <Brian.Ad...@chacha.com>
Subject RE: Regex Match Tagger UDF?
Date Mon, 06 Dec 2010 21:55:43 GMT
No problem.
Sounds good. And no worry about messy code. We are all well aware that code often elegance
when you are just trying to get it out the door.
-----Original Message-----
From: Zach Bailey [mailto:zach.bailey@dataclip.com] 
Sent: Monday, December 06, 2010 4:46 PM
To: user@pig.apache.org
Subject: Re: Regex Match Tagger UDF?


 Great. Let me clean up the code a bit and I'd be happy to post it. I'm definitely open to
some alternatives in terms of how this UDF would be initialized, whether it is via a file
sitting on HDFS, etc. The current initialization scheme is admittedly crude but was simple
to code and works for us for now.

Cheers,
Zach


On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:

> That is an interesting approach. I like it. Not ideal, but I think it could work for
what I am doing.
> 
> In general I think that is useful to the community and you should github it. 
> By all means, I would love to use this.
> 
> I think I could extend/fork this for my need.
> 
> Thank you Zach!
> 
> -----Original Message-----
> From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> Sent: Monday, December 06, 2010 3:38 PM
> To: user@pig.apache.org
> Subject: Re: Regex Match Tagger UDF?
> 
> 
>  Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick
algorithm [1] to do something similar to what you're asking for. It works as follows:
> 
> 
> 1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result
to output when that token is found:
> 
> 
> define AC_MATCHER 
> com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit 
> bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')
> 
> 
> 2.) apply the AC_MATCHER to a tuple
> 
> 
> strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings = 
> FOREACH strings GENERATE string, AC_MATCHER(string) as tags;
> 
> 
> The tagged_strings will then contain the original line along with a bag of matches. For
instance if we had the following in myfile.txt:
> 
> 
> terrier parakeet
> hello
> goodbye
> tabby
> pit bull
> 
> 
> after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):
> 
> 
> { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string: 
> 'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby', 
> tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }
> 
> 
> If this is something you'd be interested in using/extended I can put it up on github
for your forking pleasure.
> 
> Cheers,
> Zach
> 
> 
> On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:
> 
> 
> >  I have al is of regex patterns that I would like to run against a 
> > data  set, and if it matches a particular pattern in the list, tag 
> > it with  the predefined tag for that pattern.
> >  Has this been done, or available somewhere? 
> >  I've not written any UDF's, and although I'm not against doing so, 
> > I  probably don't have the time to write one at this point.
> > 
> >  If this isn't available somewhere I can work around this roadblock,  
> > but it would be awesome if someone has cooked up this functionality  
> > somewhere.
> > 
> >  -----Original Message-----
> >  From: Anze [mailto:anzenews@volja.net]
> >  Sent: Monday, December 06, 2010 3:09 PM
> >  To: user@pig.apache.org
> >  Subject: Re: Easy question...difference between this::form and  
> > this.form?
> > 
> > 
> >  Sorry to hijack your question, Jonathan, but while we are at it... 
> > :)
> > 
> >  Is there a way to tell Pig NOT to add "base_alias::"? Almost half 
> > my  code consists of FOREACH... GENERATE that just remove these prefixes.
> > 
> >  Thanks,
> > 
> >  Anze
> > 
> >  On Monday 06 December 2010, Daniel Dai wrote:
> > 
> > > After join, cross, foreach flatten, Pig will automatically add 
> > > "base_alias::" prefix. All other cases use "."
> > > 
> > > Daniel
> > > 
> > > Jonathan Coveney wrote:
> > > > It's very hard to search for this among the docs because it's so
> > > 
> > > 
> >  generic,
> > 
> > > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > > 
> > > > Taking a look at this code that I found online, for example
> > > > 
> > > > --
> > > > -- Read in a bag of tuples (timeseries for this example) and 
> > > > divide
> > > 
> > > 
> >  the
> > 
> > > > -- numeric column by its maximum.
> > > > --
> > > > %default DATABAG 'data/timeseries.tsv'
> > > > 
> > > > data = LOAD '$DATABAG' AS (month:chararray, count:int); 
> > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate 
> > > > GENERATE FLATTEN(data),
> > > > MAX(data.count) AS max_count;
> > > > normalize = FOREACH calc_max GENERATE data::month AS month, 
> > > > data::count AS count, (float)data::count / (float)max_count AS 
> > > > normed_count; DUMP normalize;
> > > > 
> > > > What purpose does data::month serve versus data.count?
> > > > 
> > > > Thanks
> > > 
> > > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 



Mime
View raw message