Return-Path: Delivered-To: apmail-pig-user-archive@www.apache.org Received: (qmail 66408 invoked from network); 7 Dec 2010 19:34:43 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Dec 2010 19:34:43 -0000 Received: (qmail 43790 invoked by uid 500); 7 Dec 2010 19:34:42 -0000 Delivered-To: apmail-pig-user-archive@pig.apache.org Received: (qmail 43761 invoked by uid 500); 7 Dec 2010 19:34:42 -0000 Mailing-List: contact user-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@pig.apache.org Delivered-To: mailing list user@pig.apache.org Received: (qmail 43753 invoked by uid 99); 7 Dec 2010 19:34:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Dec 2010 19:34:42 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of zach.bailey@dataclip.com designates 209.85.160.177 as permitted sender) Received: from [209.85.160.177] (HELO mail-gy0-f177.google.com) (209.85.160.177) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Dec 2010 19:34:37 +0000 Received: by gyg4 with SMTP id 4so256216gyg.22 for ; Tue, 07 Dec 2010 11:34:15 -0800 (PST) Received: by 10.151.106.4 with SMTP id i4mr2420427ybm.226.1291750454826; Tue, 07 Dec 2010 11:34:14 -0800 (PST) Received: from znbailey-2.local ([173.160.76.185]) by mx.google.com with ESMTPS id 31sm4045833yhl.30.2010.12.07.11.34.14 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 07 Dec 2010 11:34:14 -0800 (PST) Date: Tue, 7 Dec 2010 14:34:13 -0500 From: Zach Bailey To: user@pig.apache.org Message-ID: In-Reply-To: References: Subject: Re: Regex Match Tagger UDF? X-Mailer: sparrow 1.0beta6 (build 398) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="4cfe8c35_74d78549_cb51" Content-Transfer-Encoding: 8bit --4cfe8c35_74d78549_cb51 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Content-Disposition: inline Dmitriy, I'm happy to contribute those UDF classes to that Github repo. Are there instructions anywhere on how I should go about doing so? Of main concern are: * how to get repo access (should I fork and do a pull request?), * style/format/naming restrictions/suggestions (java code format - checkstyle, should the UDFs be upper cased, camel cased, etc.) * java package restrictions/suggestions (can the UDFs stay in com.dataclip.piggybank or should they be repackaged elsewhere) * how to handle repackaged code/libraries (one of my UDFs depends on a repackaged implementation of the Aho-Corasick algorithm) * pig version compatibility (the repo has 0.6.1, mine are written against 0.7.0) Thanks, Zach On Monday, December 6, 2010 at 9:26 PM, Dmitriy Ryaboy wrote: > Zach, > Do you mind contributing that directly to the Piggybank's upcoming home, > https://github.com/wilbur/Piggybank ? > > D > > On Mon, Dec 6, 2010 at 2:25 PM, Zach Bailey wrote: > > > > > > Here you go: > > > > > > https://github.com/znbailey/Dataclip-Piggybank > > > > > > The UDF you'll be interested in is here: > > > > > > > > https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java > > > > > > I would recommend grabbing the entire repo as that UDF depends on the > > repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick > > > > > > Enjoy, > > Zach > > > > > > On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote: > > > > > No problem. > > > Sounds good. And no worry about messy code. We are all well aware that > > code often elegance when you are just trying to get it out the door. > > > -----Original Message----- > > > From: Zach Bailey [mailto:zach.bailey@dataclip.com] > > > Sent: Monday, December 06, 2010 4:46 PM > > > To: user@pig.apache.org > > > Subject: Re: Regex Match Tagger UDF? > > > > > > > > > Great. Let me clean up the code a bit and I'd be happy to post it. I'm > > definitely open to some alternatives in terms of how this UDF would be > > initialized, whether it is via a file sitting on HDFS, etc. The current > > initialization scheme is admittedly crude but was simple to code and works > > for us for now. > > > > > > Cheers, > > > Zach > > > > > > > > > On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote: > > > > > > > > > > That is an interesting approach. I like it. Not ideal, but I think it > > could work for what I am doing. > > > > > > > > In general I think that is useful to the community and you should > > github it. > > > > By all means, I would love to use this. > > > > > > > > I think I could extend/fork this for my need. > > > > > > > > Thank you Zach! > > > > > > > > -----Original Message----- > > > > From: Zach Bailey [mailto:zach.bailey@dataclip.com] > > > > Sent: Monday, December 06, 2010 3:38 PM > > > > To: user@pig.apache.org > > > > Subject: Re: Regex Match Tagger UDF? > > > > > > > > > > > > Does the UDF have to support regular expressions? If not, I have > > adapted the Aho-Corasick algorithm [1] to do something similar to what > > you're asking for. It works as follows: > > > > > > > > > > > > 1.) Initialize the Aho-Corasick UDF with a list of tokens to search > > for, and a result to output when that token is found: > > > > > > > > > > > > define AC_MATCHER > > > > com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit > > > > bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]') > > > > > > > > > > > > 2.) apply the AC_MATCHER to a tuple > > > > > > > > > > > > strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings = > > > > FOREACH strings GENERATE string, AC_MATCHER(string) as tags; > > > > > > > > > > > > The tagged_strings will then contain the original line along with a > > bag of matches. For instance if we had the following in myfile.txt: > > > > > > > > > > > > terrier parakeet > > > > hello > > > > goodbye > > > > tabby > > > > pit bull > > > > > > > > > > > > after running the commands in #2 tagged_strings would look like > > (pardon the ad-hoc notation): > > > > > > > > > > > > { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string: > > > > 'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby', > > > > tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } } > > > > > > > > > > > > If this is something you'd be interested in using/extended I can put > > it up on github for your forking pleasure. > > > > > > > > Cheers, > > > > Zach > > > > > > > > > > > > On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote: > > > > > > > > > > > > > I have al is of regex patterns that I would like to run against a > > > > > data set, and if it matches a particular pattern in the list, tag > > > > > it with the predefined tag for that pattern. > > > > > Has this been done, or available somewhere? > > > > > I've not written any UDF's, and although I'm not against doing so, > > > > > I probably don't have the time to write one at this point. > > > > > > > > > > If this isn't available somewhere I can work around this roadblock, > > > > > but it would be awesome if someone has cooked up this functionality > > > > > somewhere. > > > > > > > > > > -----Original Message----- > > > > > From: Anze [mailto:anzenews@volja.net] > > > > > Sent: Monday, December 06, 2010 3:09 PM > > > > > To: user@pig.apache.org > > > > > Subject: Re: Easy question...difference between this::form and > > > > > this.form? > > > > > > > > > > > > > > > Sorry to hijack your question, Jonathan, but while we are at it... > > > > > :) > > > > > > > > > > Is there a way to tell Pig NOT to add "base_alias::"? Almost half > > > > > my code consists of FOREACH... GENERATE that just remove these > > prefixes. > > > > > > > > > > Thanks, > > > > > > > > > > Anze > > > > > > > > > > On Monday 06 December 2010, Daniel Dai wrote: > > > > > > > > > > > After join, cross, foreach flatten, Pig will automatically add > > > > > > "base_alias::" prefix. All other cases use "." > > > > > > > > > > > > Daniel > > > > > > > > > > > > Jonathan Coveney wrote: > > > > > > > It's very hard to search for this among the docs because it's so > > > > > > > > > > > > > > > > > generic, > > > > > > > > > > > > so I thought I'd ask... I'm sure the answer is painfully easy. > > > > > > > > > > > > > > Taking a look at this code that I found online, for example > > > > > > > > > > > > > > -- > > > > > > > -- Read in a bag of tuples (timeseries for this example) and > > > > > > > divide > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > -- numeric column by its maximum. > > > > > > > -- > > > > > > > %default DATABAG 'data/timeseries.tsv' > > > > > > > > > > > > > > data = LOAD '$DATABAG' AS (month:chararray, count:int); > > > > > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate > > > > > > > GENERATE FLATTEN(data), > > > > > > > MAX(data.count) AS max_count; > > > > > > > normalize = FOREACH calc_max GENERATE data::month AS month, > > > > > > > data::count AS count, (float)data::count / (float)max_count AS > > > > > > > normed_count; DUMP normalize; > > > > > > > > > > > > > > What purpose does data::month serve versus data.count? > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --4cfe8c35_74d78549_cb51--