pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Ryaboy <dvrya...@gmail.com>
Subject Re: Regex Match Tagger UDF?
Date Wed, 08 Dec 2010 00:59:22 GMT
All good questions. I'll put all of this into a readme in the project, and
on the Pig wiki.
Thanks for you willingness to contribute!

0) all contributions should have the apache license
1) fork and make a pull request. The docs I will write up will include
something along the lines of "by sending a pull request you implicitly
confirm that you have the right to release this code under the apache 2.0
license"
2) I like camel-cased UDFs and generally follow the standard Sun code
conventions, though I prefer a two-space indentation. One of the pain points
for folks contributing to the main piggybank has been an overabundance of
requirements; I think for wild-west piggybank, we will be a lot more
lenient. Which has its costs, granted...
3) no restrictions, though a more generic package would be cool. LinkedIn
already contributed stuff under com.linkedin so there's precedent. If folks
feel strongly about the implicit attribution, I am cool with that.
4) assuming they are apache, just change the ant build file. Ivy preferred
over checking in jars.
5) If your UDF is not a Load/Store func, the interface is the same, so it
doesn't matter. Most likely when I pull in the real piggybank, we'll just
change version compatibility to 8.

-D

On Tue, Dec 7, 2010 at 11:34 AM, Zach Bailey <zach.bailey@dataclip.com>wrote:

>
>  Dmitriy,
>
>
> I'm happy to contribute those UDF classes to that Github repo. Are there
> instructions anywhere on how I should go about doing so? Of main concern
> are:
>
>
> * how to get repo access (should I fork and do a pull request?),
> * style/format/naming restrictions/suggestions (java code format -
> checkstyle, should the UDFs be upper cased, camel cased, etc.)
> * java package restrictions/suggestions (can the UDFs stay in
> com.dataclip.piggybank or should they be repackaged elsewhere)
> * how to handle repackaged code/libraries (one of my UDFs depends on a
> repackaged implementation of the Aho-Corasick algorithm)
> * pig version compatibility (the repo has 0.6.1, mine are written against
> 0.7.0)
>
> Thanks,
> Zach
>
>
> On Monday, December 6, 2010 at 9:26 PM, Dmitriy Ryaboy wrote:
>
> > Zach,
> > Do you mind contributing that directly to the Piggybank's upcoming home,
> > https://github.com/wilbur/Piggybank ?
> >
> > D
> >
> > On Mon, Dec 6, 2010 at 2:25 PM, Zach Bailey <zach.bailey@dataclip.com
> >wrote:
> >
> >
> > >
> > >  Here you go:
> > >
> > >
> > > https://github.com/znbailey/Dataclip-Piggybank
> > >
> > >
> > >  The UDF you'll be interested in is here:
> > >
> > >
> > >
> > >
> https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java
> > >
> > >
> > >  I would recommend grabbing the entire repo as that UDF depends on the
> > >  repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick
> > >
> > >
> > >  Enjoy,
> > >  Zach
> > >
> > >
> > >  On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:
> > >
> > > > No problem.
> > > > Sounds good. And no worry about messy code. We are all well aware
> that
> > >  code often elegance when you are just trying to get it out the door.
> > > > -----Original Message-----
> > > > From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> > > > Sent: Monday, December 06, 2010 4:46 PM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Regex Match Tagger UDF?
> > > >
> > > >
> > > > Great. Let me clean up the code a bit and I'd be happy to post it.
> I'm
> > >  definitely open to some alternatives in terms of how this UDF would be
> > >  initialized, whether it is via a file sitting on HDFS, etc. The
> current
> > >  initialization scheme is admittedly crude but was simple to code and
> works
> > >  for us for now.
> > > >
> > > > Cheers,
> > > > Zach
> > > >
> > > >
> > > > On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:
> > > >
> > > >
> > > > > That is an interesting approach. I like it. Not ideal, but I think
> it
> > >  could work for what I am doing.
> > > > >
> > > > > In general I think that is useful to the community and you should
> > >  github it.
> > > > > By all means, I would love to use this.
> > > > >
> > > > > I think I could extend/fork this for my need.
> > > > >
> > > > > Thank you Zach!
> > > > >
> > > > > -----Original Message-----
> > > > > From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> > > > > Sent: Monday, December 06, 2010 3:38 PM
> > > > > To: user@pig.apache.org
> > > > > Subject: Re: Regex Match Tagger UDF?
> > > > >
> > > > >
> > > > > Does the UDF have to support regular expressions? If not, I have
> > >  adapted the Aho-Corasick algorithm [1] to do something similar to what
> > >  you're asking for. It works as follows:
> > > > >
> > > > >
> > > > > 1.) Initialize the Aho-Corasick UDF with a list of tokens to search
> > >  for, and a result to output when that token is found:
> > > > >
> > > > >
> > > > > define AC_MATCHER
> > > > > com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit
> > > > >
> bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')
> > > > >
> > > > >
> > > > > 2.) apply the AC_MATCHER to a tuple
> > > > >
> > > > >
> > > > > strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings
=
> > > > > FOREACH strings GENERATE string, AC_MATCHER(string) as tags;
> > > > >
> > > > >
> > > > > The tagged_strings will then contain the original line along with
a
> > >  bag of matches. For instance if we had the following in myfile.txt:
> > > > >
> > > > >
> > > > > terrier parakeet
> > > > > hello
> > > > > goodbye
> > > > > tabby
> > > > > pit bull
> > > > >
> > > > >
> > > > > after running the commands in #2 tagged_strings would look like
> > >  (pardon the ad-hoc notation):
> > > > >
> > > > >
> > > > > { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string:
> > > > > 'hello', tags: {} } { string: 'goodbye', tags: {} } { string:
> 'tabby',
> > > > > tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }
> > > > >
> > > > >
> > > > > If this is something you'd be interested in using/extended I can
> put
> > >  it up on github for your forking pleasure.
> > > > >
> > > > > Cheers,
> > > > > Zach
> > > > >
> > > > >
> > > > > On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:
> > > > >
> > > > >
> > > > > > I have al is of regex patterns that I would like to run against
a
> > > > > > data set, and if it matches a particular pattern in the list,
tag
> > > > > > it with the predefined tag for that pattern.
> > > > > > Has this been done, or available somewhere?
> > > > > > I've not written any UDF's, and although I'm not against doing
> so,
> > > > > > I probably don't have the time to write one at this point.
> > > > > >
> > > > > > If this isn't available somewhere I can work around this
> roadblock,
> > > > > > but it would be awesome if someone has cooked up this
> functionality
> > > > > > somewhere.
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Anze [mailto:anzenews@volja.net]
> > > > > > Sent: Monday, December 06, 2010 3:09 PM
> > > > > > To: user@pig.apache.org
> > > > > > Subject: Re: Easy question...difference between this::form and
> > > > > > this.form?
> > > > > >
> > > > > >
> > > > > > Sorry to hijack your question, Jonathan, but while we are at
> it...
> > > > > > :)
> > > > > >
> > > > > > Is there a way to tell Pig NOT to add "base_alias::"? Almost
half
> > > > > > my code consists of FOREACH... GENERATE that just remove these
> > >  prefixes.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Anze
> > > > > >
> > > > > > On Monday 06 December 2010, Daniel Dai wrote:
> > > > > >
> > > > > > > After join, cross, foreach flatten, Pig will automatically
add
> > > > > > > "base_alias::" prefix. All other cases use "."
> > > > > > >
> > > > > > > Daniel
> > > > > > >
> > > > > > > Jonathan Coveney wrote:
> > > > > > > > It's very hard to search for this among the docs because
it's
> so
> > > > > > >
> > > > > > >
> > > > > > generic,
> > > > > >
> > > > > > > > so I thought I'd ask... I'm sure the answer is painfully
> easy.
> > > > > > > >
> > > > > > > > Taking a look at this code that I found online, for
example
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Read in a bag of tuples (timeseries for this example)
and
> > > > > > > > divide
> > > > > > >
> > > > > > >
> > > > > > the
> > > > > >
> > > > > > > > -- numeric column by its maximum.
> > > > > > > > --
> > > > > > > > %default DATABAG 'data/timeseries.tsv'
> > > > > > > >
> > > > > > > > data = LOAD '$DATABAG' AS (month:chararray, count:int);
> > > > > > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate
> > > > > > > > GENERATE FLATTEN(data),
> > > > > > > > MAX(data.count) AS max_count;
> > > > > > > > normalize = FOREACH calc_max GENERATE data::month
AS month,
> > > > > > > > data::count AS count, (float)data::count / (float)max_count
> AS
> > > > > > > > normed_count; DUMP normalize;
> > > > > > > >
> > > > > > > > What purpose does data::month serve versus data.count?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message