incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <>
Subject Re: Using PIG for processing Chuckwa files
Date Fri, 05 Feb 2010 17:56:45 GMT
Hi Vincent,

Let us know your findings.  Chukwa Storage is left open ended because real
time reports and offline reports are very different in the ways that the
data should be processed.  In a perfect world, both processing models should
be supported by Chukwa.  However, current Chukwa only supports offline
report (batch processing).  I am hopping that real time processing could
happen when Chukwa team finds a solid random access storage system.  So far,
the demand of real time report has been decreasing in Chukwa community.

I always wonder if it would be useful to load Chukwa data into Hbase, but I
haven't spend enough time to work on this.  Your experience with Hbase is a
good indicator that I probably should wait on doing integration with Hbase.
Thanks for sharing your experience.


On 2/5/10 12:14 AM, "Vincent Barat" <> wrote:

> Thank you very much for your detailed answer. I'm going to
> investigate this today. It seems that it can fit my needs:
> I currently develop an analytic system that gather logs from
> cellphones and do some computation on them using PIG. Currently, we
> use HBase to:
> - store our logs in big tables in a structured way (an equivalent to
> the ChukwaRecord)
> - allow several log writers to write in the same table at the same
> time (an equivalent to the Chukwa's agents and collectors)
> - remove potentially duplicated logs (by using HBase key)
> We face several issues with HBase:
> - we are enable to setup a HBase cluster that is reliable (HBase
> fails very often and we are obliged to restart all our region
> servers each time it happens)
> - HBase consumes too much memory compared to a more simple
> Hadoop/HDFS only solution (this requires to use very expensive
> machine for our cluster nodes)
> - HBase loader for PIG is way too slow (x10 slower) compared to
> other PIG loader (BinStorage or PigStorage). This forces us to first
> load data from HBase and write them to regular HDFS files using PIG
> before computing the statistics.
> So I currently investigate alternative solutions to HBase that fits
> our need.
> Le 04/02/10 19:31, Corbin Hoenes a écrit :
>> Vincent,
>> Yes there is Pig support.  I am just learning how to use it but with some
>> help from people on this list have been successful in using Pig to analyze
>> chukwa collected logs.
>> In  ${CHUKWA_HOME}/contrib/chukwa-pig/ you'll have a chukwa-pig.jar which
>> contains the ChukwaStorage loader.
>> Once you have that you can use it like this:
>> register /[your chukwa path here]/chukwa-core-0.3.0.jar
>> register /[your udf path here]/lib/chukwa-pig.jar
>> records = LOAD '$in_file' using org.apache.hadoop.chukwa.ChukwaStorage() as
>> (ts:long, fields);
>> named_records = FOREACH records GENERATE fields#'URI' as
>> uri,fields#'RECORD_TYPE' as type,fields#'CLIENT_IP_ADDRESS' as ip;
>> dump named_records;
>> Chukwa files are sequence file format that uses a "ChukwaRecord" which are
>> key,value pairs.  You can organize your data in the ChukwaRecords in a custom
>> format if needed by using a Custom Processor for your data type.  Example
>> above shows a bunch of custom fields like URI that were parse out of the log
>> files via a processor.  This can make it a bit easier for your pig scripts to
>> get data out.
>> On Feb 4, 2010, at 7:24 AM, Vincent Barat wrote:
>>> Hello,
>>> I'm currently evaluating Chuckwa and I wonder if there is a way to use PIG
>>> to map/reduce the files produced by Chuckwa?
>>> If yes, is there a special PIG loader to use?
>>> What is the format of Chuckwa files? Is it just a concatenation of all logs
>>> sent by the agents?
>>> Thanks for your help.
>>> <vincent_barat.vcf>

View raw message