Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of zhiwei.uk@gmail.com
 designates 209.85.213.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <CBE018CC.3AB73%evans@yahoo-inc.com>
References: 
 <CADENB100tNfhvNwinBvXfsY_5=BjE7__9zALN88fdzRp7ygJDQ@mail.gmail.com>
	<CBE018CC.3AB73%evans@yahoo-inc.com>
Date: Tue, 22 May 2012 11:02:40 +0100
Message-ID: 
 <CADENB128b4sjAW2G=8B8quxcrZZEVRzDUsP+rxzaZVM_Xvy3og@mail.gmail.com>
Subject: Re: Stream data processing
From: Zhiwei Lin <zhiwei.uk@gmail.com>
To: common-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf3056406b7832c904c09d1e43

--20cf3056406b7832c904c09d1e43
Content-Type: text/plain; charset=ISO-8859-1

Hi Robert,
Thank you.
How quickly do you have to get the result out once the new data is added?
If possible, I hope to get the result instantly.

How far back in time do you have to look for BBBB from the occurrence of
bbbb?
The time slot is not constant. It depends on the "last" occurrence of BBBB
in front of bbbb.  So, I need to look up the history to get the last BBBB
in this case.

Do you have to do this for all combinations of values or is it just a small
subset of values?
I think this depends on the time of last occurrence of BBBB in the history.
If BBBB rarely occurred, then the early stage data has to be taken into
account.

Definitely, I think HDFS is a good place to store the data I have (the size
of daily log is above 1GB). But I am not sure if Map/Reduce can help to
handle the stated problem.

Zhiwei


On 21 May 2012 22:07, Robert Evans <evans@yahoo-inc.com> wrote:

> Zhiwei,
>
> How quickly do you have to get the result out once the new data is added?
>  How far back in time do you have to look for BBBB from the occurrence of
> bbbb?  Do you have to do this for all combinations of values or is it just
> a small subset of values?
>
> --Bobby Evans
>
> On 5/21/12 3:01 PM, "Zhiwei Lin" <zhiwei.uk@gmail.com> wrote:
>
> I have large volume of stream log data. Each data record contains a time
> stamp, which is very important to the analysis.
> For example, I have data format like this:
> (1) 20:30:21 01/April/2012    AAAAA.............
> (2) 20:30:51 01/April/2012    BBBB.............
> (3) 21:30:21 01/April/2012    bbbb.............
>
> Moreover, new data comes every few minutes.
> I have to calculate the probability of the occurrence "bbbb" given the
> occurrence of "BBBB" (where BBBB occurs earlier than bbbb). So, it is
> really time-dependant.
>
> I wonder if Hadoop  is the right platform for this job? Is there any
> package available for this kind of work?
>
> Thank you.
>
> Zhiwei
>
>


-- 

Best wishes.

Zhiwei

--20cf3056406b7832c904c09d1e43--