Return-Path: X-Original-To: apmail-storm-user-archive@minotaur.apache.org Delivered-To: apmail-storm-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BB827105A7 for ; Fri, 7 Feb 2014 04:57:17 +0000 (UTC) Received: (qmail 24837 invoked by uid 500); 7 Feb 2014 04:57:16 -0000 Delivered-To: apmail-storm-user-archive@storm.apache.org Received: (qmail 24463 invoked by uid 500); 7 Feb 2014 04:57:14 -0000 Mailing-List: contact user-help@storm.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@storm.incubator.apache.org Delivered-To: mailing list user@storm.incubator.apache.org Received: (qmail 24443 invoked by uid 99); 7 Feb 2014 04:57:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Feb 2014 04:57:13 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of aniket.alhat@gmail.com designates 209.85.217.178 as permitted sender) Received: from [209.85.217.178] (HELO mail-lb0-f178.google.com) (209.85.217.178) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Feb 2014 04:57:09 +0000 Received: by mail-lb0-f178.google.com with SMTP id u14so2201772lbd.23 for ; Thu, 06 Feb 2014 20:56:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=kjfSBRcRtmah6R7uPVmLOqEaM7pmRwEZexF2D76Ogbs=; b=JkZmRQwb/n2ZAe51meD0oncqGAOdTDAG65C/4TIOD0S2qlXhqk/bAqahDh/F5+RFs2 EbLXZCcr6aTaCY6cgwjAwojITXT5G+Sw23pQGTHOgGk6fZpx7jlRfxrByTGoeNeglvUr 1beBH/wnBQGOEKhO7qqF7VuQgqwgq/Kzgy4KgD1twmS2T3DWcBaHIks9QoVl5i/UvdU2 n/pl8lQGpDiCr3egWwHKnYQ8wcYOfcNpP5mLyn+JBqI/MkfN1kxgOy0DNc2oQHDRG47B oOQewOmBk+eTwhzHzZdj/0Q7ruWRICDlM5CpbMaZuq8L/VQZo1Yc4Uv4HKXU/etSLxtW oZgg== MIME-Version: 1.0 X-Received: by 10.112.55.65 with SMTP id q1mr7899626lbp.11.1391749007781; Thu, 06 Feb 2014 20:56:47 -0800 (PST) Received: by 10.112.11.200 with HTTP; Thu, 6 Feb 2014 20:56:47 -0800 (PST) Received: by 10.112.11.200 with HTTP; Thu, 6 Feb 2014 20:56:47 -0800 (PST) In-Reply-To: References: Date: Fri, 7 Feb 2014 10:26:47 +0530 Message-ID: Subject: Re: How to efficiently store the intermediate result of a bolt, and so it can be replayed after the crashes? From: Aniket Alhat To: user@storm.incubator.apache.org Content-Type: multipart/alternative; boundary=001a11c3ee8e39b32e04f1c9d120 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3ee8e39b32e04f1c9d120 Content-Type: text/plain; charset=ISO-8859-1 I hope this helps https://github.com/pict2014/storm-redis On Feb 7, 2014 12:07 AM, "Cheng-Kang Hsieh (Andy)" wrote: > Sorry, I realized that question was badly written. Simply put, my question > is that is there a recommended way to store the tuples emitted by a BOLT so > that the tuples can be replayed after crash without repeating the process > all the way up from the source spout? any advice would be appreciated. > Thank you! > > Best, > Andy > > > On Tue, Feb 4, 2014 at 11:58 AM, Cheng-Kang Hsieh (Andy) < > changun@cs.ucla.edu> wrote: > >> Hi all, >> >> First of all, Thank Nathan and all the contributors for pulling out such a >> great framework! I am learning a lot, even just reading the discussion >> threads. >> >> I am building a topology that contains one spout along with a chain of >> bolts. (e.g. S -> A -> B, where S is the spout, A, B are bolts.) >> >> When S emits a tuple, the next bolt A will buffer the tuple in a DFS, and >> compute some aggregated values when it has received a sufficient amount of >> data and then emit the aggregation results to the next bolt B. >> >> Here comes my question, is there a recommended way to store the >> intermediate results emitted by a bolt, so that when machine crashes, the >> results can be replayed to the downstreaming bolts (i.e. bolt B)? >> >> One possible solution could be that: Don't keep any intermediate results, >> but resort to the storm's ack framework, so that the raw data will be >> replay from spout S when crash happened. >> >> However, this approach might not be appropriate in my case, as it might >> take pretty long time (like a couple of hours) before bolt A has received >> all the required data and emit the aggregated results, so that it will be >> very expensive for ack framework to keep tracking that many tuples for >> that >> long. >> >> An alternative solution could be: *making bolt A also a spout* and keep >> the >> emitted data in a DFS queue. When a result has been acked, the bolt A >> removes it from the queue. >> >> I am wondering if it is reasonable to make a task both bolt and spout at >> the same time? or if there is any better approach to do so. >> >> Thank you! >> >> -- >> Cheng-Kang Hsieh >> UCLA Computer Science PhD Student >> M: (310) 990-4297 >> A: 3770 Keystone Ave. Apt 402, >> Los Angeles, CA 90034 >> > > > > -- > Cheng-Kang Hsieh > UCLA Computer Science PhD Student > M: (310) 990-4297 > A: 3770 Keystone Ave. Apt 402, > Los Angeles, CA 90034 > --001a11c3ee8e39b32e04f1c9d120 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

I hope this helps

https://= github.com/pict2014/storm-redis

On Feb 7, 2014 12:07 AM, "Cheng-Kang Hsieh = (Andy)" <changun@cs.ucla.edu= > wrote:
Sorry, I realized that question was badly written. Simply = put, my question is that is there a recommended way to store the tuples emi= tted by a BOLT so that the tuples can be replayed after crash without repea= ting the process all the way up from the source spout? any advice would be = appreciated. Thank you!

Best,
Andy
<= br>
On Tue, Feb 4, 2014 at 11:58 AM, Cheng-Ka= ng Hsieh (Andy) <changun@cs.ucla.edu> wrote:
Hi all,

First of all= , Thank Nathan and all the contributors for pulling out such a
great fra= mework! I am learning a lot, even just reading the discussion
threads.

I am building a topology that contains one spout along with= a chain of
bolts. (e.g. S -> A =A0-> B, where S is the spout, A, B are bolts.)
When S emits a tuple, the next bolt A =A0will buffer the tuple in a D= FS, and
compute some aggregated values when it has received a sufficient= amount of
data and then emit the aggregation results to the next bolt B.

Here = comes my question, is there a recommended way to store the
intermediate = results emitted by a bolt, so that when machine crashes, the
results can= be replayed to the downstreaming bolts (i.e. bolt B)?

One possible solution could be that: Don't keep any intermediate re= sults,
but resort to the storm's ack framework, so that the raw data= will be
replay from spout S when crash happened.

However, this a= pproach might not be appropriate in my case, as it might
take pretty long time (like a couple of hours) before bolt A has receivedall the required data and emit the aggregated results, so that it will be=
very expensive for ack framework to keep tracking that many tuples for = that
long.

An alternative solution could be: *making bolt A also a spout*= and keep the
emitted data in a DFS queue. When a result has been acked,= the bolt A
removes it from the queue.

I am wondering if it is re= asonable to make a task both bolt and spout at
the same time? or if there is any better approach to do so.

Thank yo= u!

--
Cheng-Kang Hsieh
UCLA Computer Science PhD Student
M:= (310) 990-4297
A: 3770 Keystone Ave. Apt 402,
=A0 =A0 =A0Los Angeles, CA 90034



--
Cheng-Kang H= sieh
UCLA Computer Science PhD Student
M: (310) 990-4297
A: 3770 K= eystone Ave. Apt 402,
=A0 =A0 =A0Los Angeles, CA 90034
--001a11c3ee8e39b32e04f1c9d120--