incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <>
Subject Re: Chukwa Pig Data Passthrough
Date Mon, 04 Oct 2010 22:24:28 GMT
Do you want to split on the chukwa payload fields or the fields in the
record body?

I have scripts that do similar things with the body using FILTER and a
custom TOKENIZE udf I wrote to tokenize the body content. I'm using
the latest ChukwaLoader for Pig 0.7.0, but the previous one should
work the same way.

define chukwaLoader org.apache.hadoop.chukwa.pig.ChukwaLoader();
define tokenize     my.udfs.TOKENIZE();

raw = LOAD '/your/path' USING chukwaLoader AS (ts: long, fields);
bodies = FOREACH raw GENERATE tokenize((chararray)fields#'body') as
tokens, timePeriod(ts) as time;

bodies_this_period = FILTER bodies BY ((chararray)time == '[some timestamp]');

STORE bodies_this_period INTO '/some/output/path'

>From bodies_this_period you can access the different tokens using
$0.token0, bodies_this_period1, etc...

I wrote TOKENIZE to return an ordered tuple of the values found, since
Pig's TOKENIZE returns an unordered bag, which isn't that useful in
this case.


On Mon, Oct 4, 2010 at 2:35 PM, Jerome Boulon <> wrote:
> Hi Matt,
> When I designed this, the schema was NOT available in Pig. I’m not sure if
> this has changed or not.
> So I’m using the constructor as a way to get around the lack of schema
> definition but if you can get it now from the query & the storage handler
> then it should be a pretty easy thing todo.
> So do you know if the sql schema is now available in Pig?
> /Jerome.
> On 10/4/10 2:28 PM, "Matt Davies" <> wrote:
> Hey all-
> Trying to do some operations utilizing Chukwa and Pig.  Would like to
> basically
> 1. Read in the data from HDFS
> 2. Do some SPLIT operations
> 3. write the various files out with all the fields as seen during the
> loading phase.
> So, my question is this - is there a way to utilize the
> org.apache.hadoop.chukwa.ChukwaStorage(); engine to load in and then store
> out all the various fields without having to individually define fields in
> the ChukwaStorage constructor?
> Thanks,
> Matt

View raw message