hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <billgra...@gmail.com>
Subject Re: Real-time log processing in Hadoop
Date Tue, 07 Sep 2010 05:50:39 GMT
We're using Chukwa to do steps a-d before writing summary data into MySQL.
Data is written into new directories every 5 minutes. Our MR jobs and data
load into MySQL takes < 5 minutes, so after a 5 minute window closes, we
typically have summary data from that interval in MySQL about a few minutes
later.

But as Ranjib points out, how fast you can process your data depends on both
cluster size and data rate.

thanks,
Bill

On Sun, Sep 5, 2010 at 10:42 PM, Ranjib Dey <ranjibd@thoughtworks.com>wrote:

> we are  using hadoop for log crunching, and the mined data feeds on of our
> app. its not exactly real time, the is basically a mail responder which
> provides certain services given an e-mail (with prescribed format) against
> it (app@xxx.com). We have been able to bring down the response time to 30
> mins. This includes automated hadoop job submission -> processing the out
> put , and intermediate status notification. From our experiences we have
> learned the entire response time is dependent on your data size, your
> hadoop
> clusters strength etc. And you need to do the performance optimization at
> each level (as they required), which includes jvm tuning (different tuning
> in name nodes / data nodes) to app level code refactoring (like using har
> on
> hdfs  for smaller files , etc).
>
> regards
> ranjib
>
> On Mon, Sep 6, 2010 at 10:32 AM, Ricky Ho <rickyphyllis@yahoo.com> wrote:
>
> > Can anyone share their experience in doing real-time log processing using
> > Chukwa/Scribe + Hadoop ?
> >
> > I am wondering how "real-time" can this be given Hadoop is designed for
> > batch
> > rather than stream processing ....
> > 1) The startup / Teardown time of running Hadoop jobs typically takes
> > minutes
> > 2) Data is typically stored in HDFS which is large file, it takes some
> time
> > to
> > accumulate data to that size.
> >
> > All these will add up to the latencies of Hadoop.  So I am wondering what
> > is the
> > shortest latencies are people doing log processing at real-life.
> >
> > To my understanding, the Chukwa/Scribe model accumulates log requests
> (from
> > many
> > machines) and write them to HDFS (inside a directory).  After the logger
> > switch
> > to a new directory, the old one is ready for Map/Reduce processing, and
> > then
> > produce the result.
> >
> > So the latency is ...
> > a) Accumulate enough data to fill an HDFS block size
> > b) Write the block to HDFS
> > c) Keep doing (b) until the criteria of switching to a new directory is
> met
> > d) Start the Map/Reduce processing in the old directory
> > e) Write the processed data to the output directory
> > f) Load the output to a queriable form.
> >
> > I think the above can easily be a 30 minutes or 1 hour duration.  Is this
> > ball-part inline with the real-life projects that you have done ?
> >
> > Rgds,
> > Ricky
> >
> >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message