incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Barat <>
Subject Re: Using PIG for processing Chuckwa files
Date Fri, 05 Feb 2010 08:14:23 GMT
Thank you very much for your detailed answer. I'm going to 
investigate this today. It seems that it can fit my needs:

I currently develop an analytic system that gather logs from 
cellphones and do some computation on them using PIG. Currently, we 
use HBase to:

- store our logs in big tables in a structured way (an equivalent to 
the ChukwaRecord)
- allow several log writers to write in the same table at the same 
time (an equivalent to the Chukwa's agents and collectors)
- remove potentially duplicated logs (by using HBase key)

We face several issues with HBase:

- we are enable to setup a HBase cluster that is reliable (HBase 
fails very often and we are obliged to restart all our region 
servers each time it happens)
- HBase consumes too much memory compared to a more simple 
Hadoop/HDFS only solution (this requires to use very expensive 
machine for our cluster nodes)
- HBase loader for PIG is way too slow (x10 slower) compared to 
other PIG loader (BinStorage or PigStorage). This forces us to first 
load data from HBase and write them to regular HDFS files using PIG 
before computing the statistics.

So I currently investigate alternative solutions to HBase that fits 
our need.

Le 04/02/10 19:31, Corbin Hoenes a écrit :
> Vincent,
> Yes there is Pig support.  I am just learning how to use it but with some help from people
on this list have been successful in using Pig to analyze chukwa collected logs.
> In  ${CHUKWA_HOME}/contrib/chukwa-pig/ you'll have a chukwa-pig.jar which contains the
ChukwaStorage loader.
> Once you have that you can use it like this:
> register /[your chukwa path here]/chukwa-core-0.3.0.jar
> register /[your udf path here]/lib/chukwa-pig.jar
> records = LOAD '$in_file' using org.apache.hadoop.chukwa.ChukwaStorage() as (ts:long,
> named_records = FOREACH records GENERATE fields#'URI' as uri,fields#'RECORD_TYPE' as
type,fields#'CLIENT_IP_ADDRESS' as ip;
> dump named_records;
> Chukwa files are sequence file format that uses a "ChukwaRecord" which are key,value
pairs.  You can organize your data in the ChukwaRecords in a custom format if needed by using
a Custom Processor for your data type.  Example above shows a bunch of custom fields like
URI that were parse out of the log files via a processor.  This can make it a bit easier for
your pig scripts to get data out.
> On Feb 4, 2010, at 7:24 AM, Vincent Barat wrote:
>> Hello,
>> I'm currently evaluating Chuckwa and I wonder if there is a way to use PIG to map/reduce
the files produced by Chuckwa?
>> If yes, is there a special PIG loader to use?
>> What is the format of Chuckwa files? Is it just a concatenation of all logs sent
by the agents?
>> Thanks for your help.
>> <vincent_barat.vcf>

View raw message