Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 85C9699E5 for ; Tue, 22 May 2012 10:03:15 +0000 (UTC) Received: (qmail 8543 invoked by uid 500); 22 May 2012 10:03:12 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 7804 invoked by uid 500); 22 May 2012 10:03:07 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 7775 invoked by uid 99); 22 May 2012 10:03:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 May 2012 10:03:06 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of zhiwei.uk@gmail.com designates 209.85.213.48 as permitted sender) Received: from [209.85.213.48] (HELO mail-yw0-f48.google.com) (209.85.213.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 May 2012 10:03:01 +0000 Received: by yhfq46 with SMTP id q46so6642851yhf.35 for ; Tue, 22 May 2012 03:02:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=qb0vmgitOzemsZy8vRU7lEiO3WIUGQlkVpGB5tbkQho=; b=Pq5avKjMUHYyOa7EyXPw2r0UoomGSVhO7aVqsGzJmW2g/qa+KRli1KKo/zDfuXv//a i403c2sPFlgynb755h5SMJ6qWSHY7L+Qm5f0dcNpSq8AAL6oRvApzX4Z7V7LLebkMczt Pt+qbWXlt5rMDltr90iBWajmKNuJQ0JlcyRsVfn3666o7stP0H9as/slvtx2UiOKjeFa +fp16jalbOOB5sgnhu+cNXPpM3hU3UHYNRJ+DJtQeRvrAUG3cOfgOsSJSzwCtcVeE0v+ uBYddTpxVLDKeyfhI4QNlY+OwOLEZglpWkZ2SFK76rhpi3BzKtzB3clMhTtoEcw/UNRY F7wA== MIME-Version: 1.0 Received: by 10.236.191.138 with SMTP id g10mr25602565yhn.25.1337680960434; Tue, 22 May 2012 03:02:40 -0700 (PDT) Received: by 10.236.152.103 with HTTP; Tue, 22 May 2012 03:02:40 -0700 (PDT) In-Reply-To: References: Date: Tue, 22 May 2012 11:02:40 +0100 Message-ID: Subject: Re: Stream data processing From: Zhiwei Lin To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf3056406b7832c904c09d1e43 X-Virus-Checked: Checked by ClamAV on apache.org --20cf3056406b7832c904c09d1e43 Content-Type: text/plain; charset=ISO-8859-1 Hi Robert, Thank you. How quickly do you have to get the result out once the new data is added? If possible, I hope to get the result instantly. How far back in time do you have to look for BBBB from the occurrence of bbbb? The time slot is not constant. It depends on the "last" occurrence of BBBB in front of bbbb. So, I need to look up the history to get the last BBBB in this case. Do you have to do this for all combinations of values or is it just a small subset of values? I think this depends on the time of last occurrence of BBBB in the history. If BBBB rarely occurred, then the early stage data has to be taken into account. Definitely, I think HDFS is a good place to store the data I have (the size of daily log is above 1GB). But I am not sure if Map/Reduce can help to handle the stated problem. Zhiwei On 21 May 2012 22:07, Robert Evans wrote: > Zhiwei, > > How quickly do you have to get the result out once the new data is added? > How far back in time do you have to look for BBBB from the occurrence of > bbbb? Do you have to do this for all combinations of values or is it just > a small subset of values? > > --Bobby Evans > > On 5/21/12 3:01 PM, "Zhiwei Lin" wrote: > > I have large volume of stream log data. Each data record contains a time > stamp, which is very important to the analysis. > For example, I have data format like this: > (1) 20:30:21 01/April/2012 AAAAA............. > (2) 20:30:51 01/April/2012 BBBB............. > (3) 21:30:21 01/April/2012 bbbb............. > > Moreover, new data comes every few minutes. > I have to calculate the probability of the occurrence "bbbb" given the > occurrence of "BBBB" (where BBBB occurs earlier than bbbb). So, it is > really time-dependant. > > I wonder if Hadoop is the right platform for this job? Is there any > package available for this kind of work? > > Thank you. > > Zhiwei > > -- Best wishes. Zhiwei --20cf3056406b7832c904c09d1e43--