pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Adams" <Brian.Ad...@chacha.com>
Subject Regex Match Tagger UDF?
Date Mon, 06 Dec 2010 20:25:42 GMT
I have al is of regex patterns that I would like to run against a data
set, and if it matches a particular pattern in the list, tag it with the
predefined tag for that pattern.
Has this been done, or available somewhere? 
I've not written any UDF's, and although I'm not against doing so, I
probably don't have the time to write one at this point.

If this isn't available somewhere I can work around this roadblock, but
it would be awesome if someone has cooked up this functionality
somewhere.

-----Original Message-----
From: Anze [mailto:anzenews@volja.net] 
Sent: Monday, December 06, 2010 3:09 PM
To: user@pig.apache.org
Subject: Re: Easy question...difference between this::form and
this.form?


Sorry to hijack your question, Jonathan, but while we are at it... :) 

Is there a way to tell Pig NOT to add "base_alias::"? Almost half my
code 
consists of FOREACH... GENERATE that just remove these prefixes. 

Thanks,

Anze

On Monday 06 December 2010, Daniel Dai wrote:
> After join, cross, foreach flatten, Pig will automatically add
> "base_alias::" prefix. All other cases use "."
> 
> Daniel
> 
> Jonathan Coveney wrote:
> > It's very hard to search for this among the docs because it's so
generic,
> > so I thought I'd ask... I'm sure the answer is painfully easy.
> > 
> > Taking a look at this code that I found online, for example
> > 
> > --
> > -- Read in a bag of tuples (timeseries for this example) and divide
the
> > -- numeric column by its maximum.
> > --
> > %default DATABAG 'data/timeseries.tsv'
> > 
> > data       = LOAD '$DATABAG' AS (month:chararray, count:int);
> > accumulate = GROUP data ALL;
> > calc_max   = FOREACH accumulate GENERATE FLATTEN(data),
> > MAX(data.count) AS max_count;
> > normalize  = FOREACH calc_max GENERATE data::month AS month,
> > data::count AS count, (float)data::count / (float)max_count AS
> > normed_count;
> > DUMP normalize;
> > 
> > What purpose does data::month serve versus data.count?
> > 
> > Thanks


Mime
View raw message