hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Hoffman <ste...@goofy.net>
Subject Re: Real-time log processing in Hadoop
Date Thu, 16 Sep 2010 20:56:50 GMT
I gave a talk on this recently at the Chicago Hadoop Users Group.
Our volume is high enough that the "chunks" of data get written into
hdfs with a few minutes delay.
This is close enough to "real time" for our initial needs, but are
quickly evaluating Flume, etc. as a collection mechanism to replace
our syslog-ng+perl voodoo.

Slides are here if you are interested:  http://bit.ly/hoffmanchug20100915

In the end you have the reality that files written into hdfs cannot be
appended to.
So if you go for "quick" you get lots of small files with is wasteful
for hadoop (writing 2K into a 64MB block).
As I said, our log volume is sufficient that we fill a 64MB block
quickly enough.  But on a lower volume logging site you could always
balance this creation of small files on small intervals with
hourly/nightly M/R jobs that roll the small files into a single larger
file.  For instance, initial playing with Flume writing into HDFS
created a new file every 15 seconds.  Clearly we'd want to combine
these after some time and remove the small pieces. Not sure if this is
automatic/baked in or not.  My guess is not yet.

Hope this helps,

On Mon, Sep 6, 2010 at 12:02 AM, Ricky Ho <rickyphyllis@yahoo.com> wrote:
> Can anyone share their experience in doing real-time log processing using
> Chukwa/Scribe + Hadoop ?
> I am wondering how "real-time" can this be given Hadoop is designed for batch
> rather than stream processing ....
> 1) The startup / Teardown time of running Hadoop jobs typically takes minutes
> 2) Data is typically stored in HDFS which is large file, it takes some time to
> accumulate data to that size.
> All these will add up to the latencies of Hadoop.  So I am wondering what is the
> shortest latencies are people doing log processing at real-life.
> To my understanding, the Chukwa/Scribe model accumulates log requests (from many
> machines) and write them to HDFS (inside a directory).  After the logger switch
> to a new directory, the old one is ready for Map/Reduce processing, and then
> produce the result.
> So the latency is ...
> a) Accumulate enough data to fill an HDFS block size
> b) Write the block to HDFS
> c) Keep doing (b) until the criteria of switching to a new directory is met
> d) Start the Map/Reduce processing in the old directory
> e) Write the processed data to the output directory
> f) Load the output to a queriable form.
> I think the above can easily be a 30 minutes or 1 hour duration.  Is this
> ball-part inline with the real-life projects that you have done ?
> Rgds,
> Ricky

View raw message