hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ricky Ho <...@adobe.com>
Subject RE: Stream Processing and Hadoop
Date Sat, 07 Nov 2009 05:01:00 GMT
Thanks !

I think this is exactly what I am looking for.
I guess the naïve implementation described in the paper is good enough.  In fact, here is
a simple prototype I've done before doing the stream-base map/reduce.



-----Original Message-----
From: Amandeep Khurana [mailto:amansk@gmail.com] 
Sent: Thursday, November 05, 2009 4:43 PM
To: common-user@hadoop.apache.org
Subject: Re: Stream Processing and Hadoop

There is a paper on this:

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Thu, Nov 5, 2009 at 4:33 PM, Ricky Ho <rho@adobe.com> wrote:

> I think the current form of Hadoop is not designed for stream-based
> processing where data is continuously stream-in and immediate processing
> (low latency) is required.  Please correct me if I am wrong.
> The main reason is because Reduce phase cannot be started until the Map
> phase is complete.  This mandates the data stream to be broken into chunks
> and processing is conducted in a batch-oriented fashion.
> But why can't we just remove the constraint and let Reduce starts before
> Map is complete.  What do we lost ?  Yes, there are something we'll lost ...
> 1) Keys arrived in the same reduce task is sorted.  If we start Reduce
> processing before all the data arrives, we cannot maintain the sort order
> anymore because data hasn't arrived yet.
> 2) If the Map process crashes in the middle of processing an input file, we
> don't know where to resume the processing.  If the Reduce process crashes,
> the result data can be lost as well.
> But most of the stream-processing analytic application doesn't require the
> above.  If my reduce function is commutative and associative, then I can
> perform incremental reduce as the data stream-in.
> Imagine a large social network site that is run on a server farm.  And each
> server has an agent process to track user behavior (what items is being
> searched, what photo is uploaded ... etc) across all the servers.
> Lets say the social site want to analyze these user activity which comes in
> as data streams from many servers.  So I want each server running a Map
> process that emit the user key (or product key) to a group of reducers which
> compute the analytics.
> Isn't this kind of processing can be run in Map/Reduce without the need for
> the Reduce to wait for the Map to be finished ?
> Does it make sense ?  Am I missing something important ?
> Rgds,
> Ricky

View raw message