Mailing-List: contact user-help@storm.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@storm.incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of aniket.alhat@gmail.com
 designates 209.85.217.178 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJsx-0q-mV++Ei_+NY+GD9-X=tUZxoHtH_zoAZNr8YPY+W-ApQ@mail.gmail.com>
References: 
 <CAJsx-0pRd4ewd3v22FL59XrSDKjbE_K097m0SXsugX7bPOVq+w@mail.gmail.com>
	<CAJsx-0q-mV++Ei_+NY+GD9-X=tUZxoHtH_zoAZNr8YPY+W-ApQ@mail.gmail.com>
Date: Fri, 7 Feb 2014 10:26:47 +0530
Message-ID: 
 <CAGW7ORgUgmHr-kOsRj7O8WAtEb-E5o5qP+jx5mCxjSvcsE31Qw@mail.gmail.com>
Subject: Re: How to efficiently store the intermediate result of a bolt, and
 so it can be replayed after the crashes?
From: Aniket Alhat <aniket.alhat@gmail.com>
To: user@storm.incubator.apache.org
Content-Type: multipart/alternative; boundary=001a11c3ee8e39b32e04f1c9d120

--001a11c3ee8e39b32e04f1c9d120
Content-Type: text/plain; charset=ISO-8859-1

I hope this helps

https://github.com/pict2014/storm-redis
On Feb 7, 2014 12:07 AM, "Cheng-Kang Hsieh (Andy)" <changun@cs.ucla.edu>
wrote:

> Sorry, I realized that question was badly written. Simply put, my question
> is that is there a recommended way to store the tuples emitted by a BOLT so
> that the tuples can be replayed after crash without repeating the process
> all the way up from the source spout? any advice would be appreciated.
> Thank you!
>
> Best,
> Andy
>
>
> On Tue, Feb 4, 2014 at 11:58 AM, Cheng-Kang Hsieh (Andy) <
> changun@cs.ucla.edu> wrote:
>
>> Hi all,
>>
>> First of all, Thank Nathan and all the contributors for pulling out such a
>> great framework! I am learning a lot, even just reading the discussion
>> threads.
>>
>> I am building a topology that contains one spout along with a chain of
>> bolts. (e.g. S -> A  -> B, where S is the spout, A, B are bolts.)
>>
>> When S emits a tuple, the next bolt A  will buffer the tuple in a DFS, and
>> compute some aggregated values when it has received a sufficient amount of
>> data and then emit the aggregation results to the next bolt B.
>>
>> Here comes my question, is there a recommended way to store the
>> intermediate results emitted by a bolt, so that when machine crashes, the
>> results can be replayed to the downstreaming bolts (i.e. bolt B)?
>>
>> One possible solution could be that: Don't keep any intermediate results,
>> but resort to the storm's ack framework, so that the raw data will be
>> replay from spout S when crash happened.
>>
>> However, this approach might not be appropriate in my case, as it might
>> take pretty long time (like a couple of hours) before bolt A has received
>> all the required data and emit the aggregated results, so that it will be
>> very expensive for ack framework to keep tracking that many tuples for
>> that
>> long.
>>
>> An alternative solution could be: *making bolt A also a spout* and keep
>> the
>> emitted data in a DFS queue. When a result has been acked, the bolt A
>> removes it from the queue.
>>
>> I am wondering if it is reasonable to make a task both bolt and spout at
>> the same time? or if there is any better approach to do so.
>>
>> Thank you!
>>
>> --
>> Cheng-Kang Hsieh
>> UCLA Computer Science PhD Student
>> M: (310) 990-4297
>> A: 3770 Keystone Ave. Apt 402,
>>      Los Angeles, CA 90034
>>
>
>
>
> --
> Cheng-Kang Hsieh
> UCLA Computer Science PhD Student
> M: (310) 990-4297
> A: 3770 Keystone Ave. Apt 402,
>      Los Angeles, CA 90034
>

--001a11c3ee8e39b32e04f1c9d120
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">I hope this helps</p>
<p dir=3D"ltr"><a href=3D"https://github.com/pict2014/storm-redis">https://=
github.com/pict2014/storm-redis</a></p>
<div class=3D"gmail_quote">On Feb 7, 2014 12:07 AM, &quot;Cheng-Kang Hsieh =
(Andy)&quot; &lt;<a href=3D"mailto:changun@cs.ucla.edu">changun@cs.ucla.edu=
</a>&gt; wrote:<br type=3D"attribution"><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr">Sorry, I realized that question was badly written. Simply =
put, my question is that is there a recommended way to store the tuples emi=
tted by a BOLT so that the tuples can be replayed after crash without repea=
ting the process all the way up from the source spout? any advice would be =
appreciated. Thank you!<div>

<br></div><div>Best,</div><div>Andy</div></div><div class=3D"gmail_extra"><=
br><br><div class=3D"gmail_quote">On Tue, Feb 4, 2014 at 11:58 AM, Cheng-Ka=
ng Hsieh (Andy) <span dir=3D"ltr">&lt;<a href=3D"mailto:changun@cs.ucla.edu=
" target=3D"_blank">changun@cs.ucla.edu</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi all,<br><br>First of all=
, Thank Nathan and all the contributors for pulling out such a<br>great fra=
mework! I am learning a lot, even just reading the discussion<br>

threads.<br><br>I am building a topology that contains one spout along with=
 a chain of<br>
bolts. (e.g. S -&gt; A =A0-&gt; B, where S is the spout, A, B are bolts.)<b=
r><br>When S emits a tuple, the next bolt A =A0will buffer the tuple in a D=
FS, and<br>compute some aggregated values when it has received a sufficient=
 amount of<br>


data and then emit the aggregation results to the next bolt B.<br><br>Here =
comes my question, is there a recommended way to store the<br>intermediate =
results emitted by a bolt, so that when machine crashes, the<br>results can=
 be replayed to the downstreaming bolts (i.e. bolt B)?<br>


<br>One possible solution could be that: Don&#39;t keep any intermediate re=
sults,<br>but resort to the storm&#39;s ack framework, so that the raw data=
 will be<br>replay from spout S when crash happened.<br><br>However, this a=
pproach might not be appropriate in my case, as it might<br>


take pretty long time (like a couple of hours) before bolt A has received<b=
r>all the required data and emit the aggregated results, so that it will be=
<br>very expensive for ack framework to keep tracking that many tuples for =
that<br>


long.<br><br>An alternative solution could be: *making bolt A also a spout*=
 and keep the<br>emitted data in a DFS queue. When a result has been acked,=
 the bolt A<br>removes it from the queue.<br><br>I am wondering if it is re=
asonable to make a task both bolt and spout at<br>


the same time? or if there is any better approach to do so.<br><br>Thank yo=
u!<br><br>--<br>Cheng-Kang Hsieh<br>UCLA Computer Science PhD Student<br>M:=
 <a href=3D"tel:%28310%29%20990-4297" value=3D"+13109904297" target=3D"_bla=
nk">(310) 990-4297</a><br>

A: 3770 Keystone Ave. Apt 402,<br>=A0 =A0 =A0Los Angeles, CA 90034<div styl=
e=3D"font-family:arial,sans-serif;font-size:12.727272033691406px">
</div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>Cheng-Kang H=
sieh<br>UCLA Computer Science PhD Student<br>M: (310) 990-4297<br>A: 3770 K=
eystone Ave. Apt 402, <br>=A0 =A0 =A0Los Angeles, CA 90034
</div>
</blockquote></div>

--001a11c3ee8e39b32e04f1c9d120--