hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Zhang <zjf...@gmail.com>
Subject Re: Parallel data stream processing
Date Sat, 10 Oct 2009 08:50:33 GMT
I snuggest you to use pig to handle your problem.  Pig is a sub-project of
hadoop.

And you do not need to worry about the boundary problem. Actually hadoop
handle that for you.

InputFormat help you split the data , and RecordReader guarantee the record
boundary.


Jeff zhang


On Sat, Oct 10, 2009 at 2:02 PM, Ricky Ho <rho@adobe.com> wrote:

> I'd like to get some Hadoop experts to verify my understanding ...
>
> To my understanding, within a Map/Reduce cycle, the input data set is
> "freeze" (no change is allowed) while the output data set is "created from
> scratch" (doesn't exist before).  Therefore, the map/reduce model is
> inherently "batch-oriented".  Am I right ?
>
> I am thinking whether Hadoop is usable in processing many data streams in
> parallel.  For example, thinking about a e-commerce site which capture
> user's product search in many log files, and they want to run some analytics
> on the log files at real time.
>
> One naïve way is to chunkify the log and perform Map/Reduce in small
> batches.  Since the input data file must be freezed, therefore we need to
> switch subsequent write to a new logfile.  However, the chunking approach is
> not good because the cutoff point is quite arbitrary.  Imagine if I want to
> calculate the popularity of a product based on the frequency of searches
> within last 2 hours (a sliding time window).  I don't think Hadoop can do
> this computation.
>
> Of course, if we don't mind a distorted picture, we can use a jumping
> window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK.  But
> this is still not good, because we have to wait for two hours before getting
> the new batch of result.  (e.g. At 4:59 PM, we only have the result in the
> 1-3 PM batch)
>
> It doesn't seem like Hadoop is good at handling this kind of processing:
>  "Parallel processing of multiple real time data stream processing".  Anyone
> disagree ?  The term "Hadoop streaming" is confusing because it means
> completely different thing to me (ie: use stdout and stdin as input and
> output data)
>
> I'm wondering if a "mapper-only" model would work better.  In this case,
> there is no reducer (ie: no grouping).  Each map task keep a history (ie:
> sliding window) of data that it has seen and then write the result to the
> output file.
>
> I heard about the "append" mode of HDFS, but don't quite get it.  Does it
> simply mean a writer can write to the end of an existing HDFS file ?  Or
> does it mean a reader can read while a writer is appending on the same HDFS
> file ?  Is this "append-mode" feature helpful in my situation ?
>
> Rgds,
> Ricky
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message