hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Parallel data stream processing
Date Sat, 10 Oct 2009 08:01:00 GMT
On Fri, Oct 9, 2009 at 11:02 PM, Ricky Ho <rho@adobe.com> wrote:

> ... To my understanding, within a Map/Reduce cycle, the input data set is
> "freeze" (no change is allowed) while the output data set is "created from
> scratch" (doesn't exist before).  Therefore, the map/reduce model is
> inherently "batch-oriented".  Am I right ?

Current implementations are definitely batch oriented.  Keep reading,

> I am thinking whether Hadoop is usable in processing many data streams in
> parallel.


> For example, thinking about a e-commerce site which capture user's product
> search in many log files, and they want to run some analytics on the log
> files at real time.

Or consider Yahoo running their ad inventories in real-time.

> One naïve way is to chunkify the log and perform Map/Reduce in small
> batches.  Since the input data file must be freezed, therefore we need to
> switch subsequent write to a new logfile.

Which is not a big deal.  Moreover, these small chunks can be merged every
so often.

> However, the chunking approach is not good because the cutoff point is
> quite arbitrary.  Imagine if I want to calculate the popularity of a product
> based on the frequency of searches within last 2 hours (a sliding time
> window).  I don't think Hadoop can do this computation.

Subject of a moderate delay of 5-20 minutes, this is no problem at all for
hadoop.  This is especially true if you are doing straightforward
aggregations that are associative and commutative.

> Of course, if we don't mind a distorted picture, we can use a jumping
> window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK.  But
> this is still not good, because we have to wait for two hours before getting
> the new batch of result.  (e.g. At 4:59 PM, we only have the result in the
> 1-3 PM batch)

Or just process each 10 minute period into aggregate form.  Then add up the
latest 12 aggregates.  Every day, merge all the small files for the day and
every month merge all the daily files.

There are very few businesses where a 10 minute delay is a big problem.

> It doesn't seem like Hadoop is good at handling this kind of processing:
>  "Parallel processing of multiple real time data stream processing".  Anyone
> disagree ?

It isn't entirely natural, but it isn't a problem.

> I'm wondering if a "mapper-only" model would work better.  In this case,
> there is no reducer (ie: no grouping).  Each map task keep a history (ie:
> sliding window) of data that it has seen and then write the result to the
> output file.

This doesn't scale at all well.

Take a look at the Chukwa project for a well worked example of how to
process logs in near real-time with Hadoop.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message