hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ricky Ho <rickyphyl...@yahoo.com>
Subject Real-time log processing in Hadoop
Date Mon, 06 Sep 2010 05:02:18 GMT
Can anyone share their experience in doing real-time log processing using 
Chukwa/Scribe + Hadoop ?

I am wondering how "real-time" can this be given Hadoop is designed for batch 
rather than stream processing ....
1) The startup / Teardown time of running Hadoop jobs typically takes minutes
2) Data is typically stored in HDFS which is large file, it takes some time to 
accumulate data to that size.

All these will add up to the latencies of Hadoop.  So I am wondering what is the 
shortest latencies are people doing log processing at real-life.

To my understanding, the Chukwa/Scribe model accumulates log requests (from many 
machines) and write them to HDFS (inside a directory).  After the logger switch 
to a new directory, the old one is ready for Map/Reduce processing, and then 
produce the result.

So the latency is ...
a) Accumulate enough data to fill an HDFS block size
b) Write the block to HDFS
c) Keep doing (b) until the criteria of switching to a new directory is met
d) Start the Map/Reduce processing in the old directory
e) Write the processed data to the output directory
f) Load the output to a queriable form.

I think the above can easily be a 30 minutes or 1 hour duration.  Is this 
ball-part inline with the real-life projects that you have done ?



View raw message