Mailing-List: contact user-help@pig.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@pig.apache.org
Received-SPF: pass (athena.apache.org: domain of zach.bailey@dataclip.com
 designates 209.85.160.177 as permitted sender)
Date: Tue, 7 Dec 2010 14:34:13 -0500
From: Zach Bailey <zach.bailey@dataclip.com>
To: user@pig.apache.org
Message-ID: <ED7B06730B1F492890804F8EA471708C@dataclip.com>
In-Reply-To: <AANLkTimzoqzwQ0Y0nMycHkx8OgvJjMub6gCmKTOW-Xwo@mail.gmail.com>
References: 
 <ABB74F89BC153F44BEA49B944CBDDEF706155DC4@ares.elysianfields.scottajones.com>
 <E3433CDB7D014328AE93E0FFFB71CC12@dataclip.com>
 <ABB74F89BC153F44BEA49B944CBDDEF706155E1C@ares.elysianfields.scottajones.com>
 <E3ED82E5FBE74BDE8759F216CBB4451F@dataclip.com>
 <ABB74F89BC153F44BEA49B944CBDDEF706155E80@ares.elysianfields.scottajones.com>
 <D89C0639389D496F935DD402E73AA927@dataclip.com>
 <AANLkTimzoqzwQ0Y0nMycHkx8OgvJjMub6gCmKTOW-Xwo@mail.gmail.com>
Subject: Re: Regex Match Tagger UDF?
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="4cfe8c35_74d78549_cb51"
Content-Transfer-Encoding: 8bit

--4cfe8c35_74d78549_cb51
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
Content-Disposition: inline


 Dmitriy,


I'm happy to contribute those UDF classes to that Github repo. Are there instructions anywhere on how I should go about doing so? Of main concern are:


* how to get repo access (should I fork and do a pull request?),
* style/format/naming restrictions/suggestions (java code format - checkstyle, should the UDFs be upper cased, camel cased, etc.)
* java package restrictions/suggestions (can the UDFs stay in com.dataclip.piggybank or should they be repackaged elsewhere)
* how to handle repackaged code/libraries (one of my UDFs depends on a repackaged implementation of the Aho-Corasick algorithm)
* pig version compatibility (the repo has 0.6.1, mine are written against 0.7.0)

Thanks,
Zach


On Monday, December 6, 2010 at 9:26 PM, Dmitriy Ryaboy wrote:

> Zach,
> Do you mind contributing that directly to the Piggybank's upcoming home,
> https://github.com/wilbur/Piggybank ?
> 
> D
> 
> On Mon, Dec 6, 2010 at 2:25 PM, Zach Bailey <zach.bailey@dataclip.com>wrote:
> 
> 
> > 
> >  Here you go:
> > 
> > 
> > https://github.com/znbailey/Dataclip-Piggybank
> > 
> > 
> >  The UDF you'll be interested in is here:
> > 
> > 
> > 
> > https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java
> > 
> > 
> >  I would recommend grabbing the entire repo as that UDF depends on the
> >  repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick
> > 
> > 
> >  Enjoy,
> >  Zach
> > 
> > 
> >  On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:
> > 
> > > No problem.
> > > Sounds good. And no worry about messy code. We are all well aware that
> >  code often elegance when you are just trying to get it out the door.
> > > -----Original Message-----
> > > From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> > > Sent: Monday, December 06, 2010 4:46 PM
> > > To: user@pig.apache.org
> > > Subject: Re: Regex Match Tagger UDF?
> > >
> > >
> > > Great. Let me clean up the code a bit and I'd be happy to post it. I'm
> >  definitely open to some alternatives in terms of how this UDF would be
> >  initialized, whether it is via a file sitting on HDFS, etc. The current
> >  initialization scheme is admittedly crude but was simple to code and works
> >  for us for now.
> > >
> > > Cheers,
> > > Zach
> > >
> > >
> > > On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:
> > >
> > >
> > > > That is an interesting approach. I like it. Not ideal, but I think it
> >  could work for what I am doing.
> > > >
> > > > In general I think that is useful to the community and you should
> >  github it.
> > > > By all means, I would love to use this.
> > > >
> > > > I think I could extend/fork this for my need.
> > > >
> > > > Thank you Zach!
> > > >
> > > > -----Original Message-----
> > > > From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> > > > Sent: Monday, December 06, 2010 3:38 PM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Regex Match Tagger UDF?
> > > >
> > > >
> > > > Does the UDF have to support regular expressions? If not, I have
> >  adapted the Aho-Corasick algorithm [1] to do something similar to what
> >  you're asking for. It works as follows:
> > > >
> > > >
> > > > 1.) Initialize the Aho-Corasick UDF with a list of tokens to search
> >  for, and a result to output when that token is found:
> > > >
> > > >
> > > > define AC_MATCHER
> > > > com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit
> > > > bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')
> > > >
> > > >
> > > > 2.) apply the AC_MATCHER to a tuple
> > > >
> > > >
> > > > strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings =
> > > > FOREACH strings GENERATE string, AC_MATCHER(string) as tags;
> > > >
> > > >
> > > > The tagged_strings will then contain the original line along with a
> >  bag of matches. For instance if we had the following in myfile.txt:
> > > >
> > > >
> > > > terrier parakeet
> > > > hello
> > > > goodbye
> > > > tabby
> > > > pit bull
> > > >
> > > >
> > > > after running the commands in #2 tagged_strings would look like
> >  (pardon the ad-hoc notation):
> > > >
> > > >
> > > > { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string:
> > > > 'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby',
> > > > tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }
> > > >
> > > >
> > > > If this is something you'd be interested in using/extended I can put
> >  it up on github for your forking pleasure.
> > > >
> > > > Cheers,
> > > > Zach
> > > >
> > > >
> > > > On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:
> > > >
> > > >
> > > > > I have al is of regex patterns that I would like to run against a
> > > > > data set, and if it matches a particular pattern in the list, tag
> > > > > it with the predefined tag for that pattern.
> > > > > Has this been done, or available somewhere?
> > > > > I've not written any UDF's, and although I'm not against doing so,
> > > > > I probably don't have the time to write one at this point.
> > > > >
> > > > > If this isn't available somewhere I can work around this roadblock,
> > > > > but it would be awesome if someone has cooked up this functionality
> > > > > somewhere.
> > > > >
> > > > > -----Original Message-----
> > > > > From: Anze [mailto:anzenews@volja.net]
> > > > > Sent: Monday, December 06, 2010 3:09 PM
> > > > > To: user@pig.apache.org
> > > > > Subject: Re: Easy question...difference between this::form and
> > > > > this.form?
> > > > >
> > > > >
> > > > > Sorry to hijack your question, Jonathan, but while we are at it...
> > > > > :)
> > > > >
> > > > > Is there a way to tell Pig NOT to add "base_alias::"? Almost half
> > > > > my code consists of FOREACH... GENERATE that just remove these
> >  prefixes.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Anze
> > > > >
> > > > > On Monday 06 December 2010, Daniel Dai wrote:
> > > > >
> > > > > > After join, cross, foreach flatten, Pig will automatically add
> > > > > > "base_alias::" prefix. All other cases use "."
> > > > > >
> > > > > > Daniel
> > > > > >
> > > > > > Jonathan Coveney wrote:
> > > > > > > It's very hard to search for this among the docs because it's so
> > > > > >
> > > > > >
> > > > > generic,
> > > > >
> > > > > > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > > > > >
> > > > > > > Taking a look at this code that I found online, for example
> > > > > > >
> > > > > > > --
> > > > > > > -- Read in a bag of tuples (timeseries for this example) and
> > > > > > > divide
> > > > > >
> > > > > >
> > > > > the
> > > > >
> > > > > > > -- numeric column by its maximum.
> > > > > > > --
> > > > > > > %default DATABAG 'data/timeseries.tsv'
> > > > > > >
> > > > > > > data = LOAD '$DATABAG' AS (month:chararray, count:int);
> > > > > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate
> > > > > > > GENERATE FLATTEN(data),
> > > > > > > MAX(data.count) AS max_count;
> > > > > > > normalize = FOREACH calc_max GENERATE data::month AS month,
> > > > > > > data::count AS count, (float)data::count / (float)max_count AS
> > > > > > > normed_count; DUMP normalize;
> > > > > > >
> > > > > > > What purpose does data::month serve versus data.count?
> > > > > > >
> > > > > > > Thanks
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 


--4cfe8c35_74d78549_cb51--