hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhiwei Lin <zhiwei...@gmail.com>
Subject Re: Stream data processing
Date Tue, 22 May 2012 13:58:44 GMT
Hi Bobby,

Thank you. Great help.

Zhiwei

On 22 May 2012 14:52, Robert Evans <evans@yahoo-inc.com> wrote:

> If you want the results to come out instantly Map/Reduce is not the proper
> choice.  Map/Reduce is designed for batch processing.  It can do small
> batches, but the overhead of launching the map/redcue jobs can be very high
> compared to the amount of processing you are doing.  I personally would
> look into using either Storm, S4, or some other realtime stream processing
> framework.  From what you have said it sounds like you probably want to use
> Storm, as it can be used to guarantee that each event is processed once and
> only once.  You can also store your results into HDFS if you want, perhaps
> through HBASE, if you need to do further processing on the data.
>
> --Bobby Evans
>
> On 5/22/12 5:02 AM, "Zhiwei Lin" <zhiwei.uk@gmail.com> wrote:
>
> Hi Robert,
> Thank you.
> How quickly do you have to get the result out once the new data is added?
> If possible, I hope to get the result instantly.
>
> How far back in time do you have to look for BBBB from the occurrence of
> bbbb?
> The time slot is not constant. It depends on the "last" occurrence of BBBB
> in front of bbbb.  So, I need to look up the history to get the last BBBB
> in this case.
>
> Do you have to do this for all combinations of values or is it just a small
> subset of values?
> I think this depends on the time of last occurrence of BBBB in the history.
> If BBBB rarely occurred, then the early stage data has to be taken into
> account.
>
> Definitely, I think HDFS is a good place to store the data I have (the size
> of daily log is above 1GB). But I am not sure if Map/Reduce can help to
> handle the stated problem.
>
> Zhiwei
>
>
> On 21 May 2012 22:07, Robert Evans <evans@yahoo-inc.com> wrote:
>
> > Zhiwei,
> >
> > How quickly do you have to get the result out once the new data is added?
> >  How far back in time do you have to look for BBBB from the occurrence of
> > bbbb?  Do you have to do this for all combinations of values or is it
> just
> > a small subset of values?
> >
> > --Bobby Evans
> >
> > On 5/21/12 3:01 PM, "Zhiwei Lin" <zhiwei.uk@gmail.com> wrote:
> >
> > I have large volume of stream log data. Each data record contains a time
> > stamp, which is very important to the analysis.
> > For example, I have data format like this:
> > (1) 20:30:21 01/April/2012    AAAAA.............
> > (2) 20:30:51 01/April/2012    BBBB.............
> > (3) 21:30:21 01/April/2012    bbbb.............
> >
> > Moreover, new data comes every few minutes.
> > I have to calculate the probability of the occurrence "bbbb" given the
> > occurrence of "BBBB" (where BBBB occurs earlier than bbbb). So, it is
> > really time-dependant.
> >
> > I wonder if Hadoop  is the right platform for this job? Is there any
> > package available for this kind of work?
> >
> > Thank you.
> >
> > Zhiwei
> >
> >
>
>
> --
>
> Best wishes.
>
> Zhiwei
>
>


-- 

Best wishes.

Zhiwei

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message