Mailing-List: contact user-help@storm.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@storm.incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of
 abhishek.bhattacharjee11@gmail.com designates 209.85.212.178 as permitted
 sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJsx-0pE5ONn1__4GJ3WWFVnuq0dYvTH1B9AY8kGw2s3SMyvaw@mail.gmail.com>
References: 
 <CAJsx-0pRd4ewd3v22FL59XrSDKjbE_K097m0SXsugX7bPOVq+w@mail.gmail.com>
	<CAJsx-0q-mV++Ei_+NY+GD9-X=tUZxoHtH_zoAZNr8YPY+W-ApQ@mail.gmail.com>
	<CAGW7ORgUgmHr-kOsRj7O8WAtEb-E5o5qP+jx5mCxjSvcsE31Qw@mail.gmail.com>
	<545c18be1c664d35877e30423de74118@CO2PR07MB522.namprd07.prod.outlook.com>
	<CAJsx-0r+ux7k_ZkDpRPN=Jdp9FWHW-GNcxA97hmPXhj+UMtpKw@mail.gmail.com>
	<CAHtRq2ftw8-j5sPVr89Ca9HNkSQ4F_3-ztB73v=jwQ2eMoyCWA@mail.gmail.com>
	<1279108c9d714c22ae0618126081ef72@CO2PR07MB522.namprd07.prod.outlook.com>
	<CAJsx-0pE5ONn1__4GJ3WWFVnuq0dYvTH1B9AY8kGw2s3SMyvaw@mail.gmail.com>
Date: Wed, 12 Feb 2014 14:06:44 +0530
Message-ID: 
 <CAHz+tL2sGDq=yWb7sFxivi12m6FxeVHDK=VKUC5DCSeSQdfsGw@mail.gmail.com>
Subject: Re: How to efficiently store the intermediate result of a bolt, and
 so it can be replayed after the crashes?
From: Abhishek Bhattacharjee <abhishek.bhattacharjee11@gmail.com>
To: user@storm.incubator.apache.org
Content-Type: multipart/alternative; boundary=047d7b86ccbc0741d104f231799c

--047d7b86ccbc0741d104f231799c
Content-Type: text/plain; charset=ISO-8859-1

Hi Cheng,
If you see the repo that Aniket posted the link of and read its README
you'll get what you are asking for in the above mail.
I'll repost the link here https://github.com/pict2014/storm-redis . This
does what you are asking for.
It uses *kafka* for replaying and *redis* for caching the intermediate
state in batches. If you have a good understanding of storm then you can
read the
code and understand how it works. It uses transactional topologies.

Thanks,


On Wed, Feb 12, 2014 at 4:17 AM, Cheng-Kang Hsieh (Andy) <
changun@cs.ucla.edu> wrote:

> Hi Adrian
>
> Thank you so much for the input! If I understand how Spout works
> correctly, wouldn't the tuple be regarded failed if it has not been fully
> acked before the timeout? (which, by default, is 30 secs) From my
> understanding (can be totally wrong though), a storm-ish way to response to
> a failed tuple is to call the *fail* method in the root Spout, which, in
> turn, re-emits the failed tuple to the topology.
>
> It will be nice, if there is a *fail *method in the intermediate bolt
> that will be called when the down-streaming bolts failed; then this bolt
> can just re-emit the intermediate results to the downstreaming bolt without
> restarting the process all the way up from the root spout.
>
> A use case of that will be that, says I have 3 components chained together
> as follows: Spout -> Bolt1 -> Bolt2. What Bolt1 does is to aggregate the
> data within every fixed-size time window in a day and compute some
> measurements based on that.(e.g. the user's activities of each hour in a
> day). With the current design of storm, when the Bolt 2 fails, the Spout
> has to manage to resend all the data in the corresponding time window for
> the Bolt 1 to recompute the results. It will be nice if Bolt1 can cache the
> results and resend it when the Bolt 2 fails.
>
> Does it make any sense?
> Any input is appreciated!
>
> Best,
> Andy
>
>
> On Tue, Feb 11, 2014 at 5:03 PM, Adrian Mocanu <amocanu@verticalscope.com>wrote:
>
>>  You can have acks from bolt to bolt.
>>
>>
>>
>> Spout:
>>
>>  //ties in tuple to this UID
>>
>> _collector.emit(new Values(queue.dequeue(), *uniqueID*)
>>
>>
>>
>> Then Bolt1 will ack the tuple only after it emits it to Bolt2 so that the
>> ack can be tied to the tuple
>>
>> Bolt1:
>>
>>  //emit first then ack
>>
>> _collector.emit(tuple, new Values("stuff")) //**anchoring** - read below
>> to see what it means
>>
>> _collector.ack(tuple)
>>
>>
>>
>> At this point tuple from Spout has been acked in Bolt1, but at the same
>> time the newly emitted tuple "stuff" to Bolt2 is "anchored" to the tuple
>> from Spout. What this means is that it still needs to be acked later on
>> otherwise on timeout it will be resent by spout.
>>
>> Bolt2:
>>
>> _collector.ack(tuple)
>>
>> Bolt2 needs to ack the tuple received from Bolt1 which will send in the
>> last ack that Spout was waiting for. If at this point Bolt2 emits tuple,
>> then there must be a Bolt3 which will get it and ack it. If the tuple is
>> not acked at the last point, Spout will time it out and resend it.
>>
>> Each time anchoring is done on an emit statement from bolt to bolt, a
>> new node in a "tree" structure is built... well more like a list in my case
>> since I never send the same tuple to 2 or more tuples, I have a 1 to 1
>> relationship.
>>
>> All nodes in the tree need to be acked, and only then the tuple is marked
>> as fully arrived. If the tuple is not acked and it is sent with a UID and
>> anchored later on then it will be kept in memory forever (until acked).
>>
>> Hope this helps.
>>
>>
>>
>> *From:* Tom Brown [mailto:tombrown52@gmail.com]
>> *Sent:* February-11-14 4:57 PM
>>
>> *To:* user@storm.incubator.apache.org
>> *Subject:* Re: How to efficiently store the intermediate result of a
>> bolt, and so it can be replayed after the crashes?
>>
>>
>>
>> We use 2 storm topologies, with kafka in between:  Kafka --> TopologyA
>> --> Kafka --> TopologyB --> Final output
>>
>>
>>
>> This allows the two halves of computation to be scaled and maintained
>> independently.
>>
>>
>>
>> --Tom
>>
>>
>>
>> On Tue, Feb 11, 2014 at 2:36 PM, Cheng-Kang Hsieh (Andy) <
>> changun@cs.ucla.edu> wrote:
>>
>>  Hi Aniket & Andrian,
>>
>>
>>
>> Thank you guys so much for the kind reply! Although the replies don't
>> directly solve my problem, it has been very rewarding following the code of
>> redis-storm and Trident.
>>
>>
>>
>> I guess storing the intermediate data in an external db (like Cassandra,
>> as suggested by Andrian) would work, but what if the Bolt that is supposed
>> to receive the intermediate data fails? In this case, the emitter is also a
>> Bolt, and does not have the nice ACK mechanism to rely on, so the emitting
>> Bolt might never know when it should resend the data to the receiving Bolt.
>>
>>
>>
>> In other framework like Samza, or Spark Streaming, all the emitted data,
>> no matter, by a Spout or Bolt is treated as the same way and so benefits
>> from the same fault tolerance mechanism (they are not as easy to use as
>> Storm though). For example, in Samza, all the data output of a component
>> are push to a Kafka queue with the receiving components as the listeners
>> (see here<http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html>
>> ).
>>
>>
>>
>> Conceptually maybe a more general solution for Storm is to make a Bolt
>> also a Spout which can receive ACKs from the receiving Bolts; however it
>> seems to violate the assumption of Storm?
>>
>>
>>
>> Again I appreciate any advice or suggestion. Thank you!
>>
>>
>>
>> Best,
>>
>> Andy
>>
>>
>>
>> On Fri, Feb 7, 2014 at 9:37 AM, Adrian Mocanu <amocanu@verticalscope.com>
>> wrote:
>>
>>  Hi Andy,
>>
>> I think you can use Trident to persist the results at any point in your
>> stream processing.
>>
>> I believe the way you do that is by using STREAM.persistentAggregate(...)
>>
>>
>>
>> Here's an example from
>> https://github.com/nathanmarz/storm/wiki/Trident-tutorial
>>
>>
>>
>> TridentTopology topology = new TridentTopology();
>>
>> TridentState wordCounts =
>>
>>      topology.newStream("spout1", spout)
>>
>>        .each(new Fields("sentence"), new Split(), new Fields("word"))
>>
>>        .groupBy(new Fields("word"))
>>
>>        .persistentAggregate(new MemoryMapState.Factory(), new Count(),
>> new Fields("count"))
>>
>>        .parallelismHint(6);
>>
>>
>>
>> In this case the counts (re[place counts with whatever operations you are
>> doing) are stored in a memory map, but you can make another class that
>> saves this intermediate result to a db... at least that's my understanding... I
>> am currently also learning these things.
>>
>> I'm currently working on a similar problem and I'm attempting to store
>> into Cassandra. Feel free to watch my conversation threads (with Svend and
>> Taylor Goetz)
>>
>>
>>
>> -A
>>
>>
>>
>> *From:* Aniket Alhat [mailto:aniket.alhat@gmail.com]
>> *Sent:* February-06-14 11:57 PM
>> *To:* user@storm.incubator.apache.org
>> *Subject:* Re: How to efficiently store the intermediate result of a
>> bolt, and so it can be replayed after the crashes?
>>
>>
>>
>> I hope this helps
>>
>> https://github.com/pict2014/storm-redis
>>
>> On Feb 7, 2014 12:07 AM, "Cheng-Kang Hsieh (Andy)" <changun@cs.ucla.edu>
>> wrote:
>>
>>  Sorry, I realized that question was badly written. Simply put, my
>> question is that is there a recommended way to store the tuples emitted by
>> a BOLT so that the tuples can be replayed after crash without repeating the
>> process all the way up from the source spout? any advice would be
>> appreciated. Thank you!
>>
>>
>>
>> Best,
>>
>> Andy
>>
>>
>>
>> On Tue, Feb 4, 2014 at 11:58 AM, Cheng-Kang Hsieh (Andy) <
>> changun@cs.ucla.edu> wrote:
>>
>>  Hi all,
>>
>> First of all, Thank Nathan and all the contributors for pulling out such a
>> great framework! I am learning a lot, even just reading the discussion
>> threads.
>>
>> I am building a topology that contains one spout along with a chain of
>> bolts. (e.g. S -> A  -> B, where S is the spout, A, B are bolts.)
>>
>> When S emits a tuple, the next bolt A  will buffer the tuple in a DFS, and
>> compute some aggregated values when it has received a sufficient amount of
>> data and then emit the aggregation results to the next bolt B.
>>
>> Here comes my question, is there a recommended way to store the
>> intermediate results emitted by a bolt, so that when machine crashes, the
>> results can be replayed to the downstreaming bolts (i.e. bolt B)?
>>
>> One possible solution could be that: Don't keep any intermediate results,
>> but resort to the storm's ack framework, so that the raw data will be
>> replay from spout S when crash happened.
>>
>> However, this approach might not be appropriate in my case, as it might
>> take pretty long time (like a couple of hours) before bolt A has received
>> all the required data and emit the aggregated results, so that it will be
>> very expensive for ack framework to keep tracking that many tuples for
>> that
>> long.
>>
>> An alternative solution could be: *making bolt A also a spout* and keep
>> the
>> emitted data in a DFS queue. When a result has been acked, the bolt A
>> removes it from the queue.
>>
>> I am wondering if it is reasonable to make a task both bolt and spout at
>> the same time? or if there is any better approach to do so.
>>
>> Thank you!
>>
>> --
>> Cheng-Kang Hsieh
>> UCLA Computer Science PhD Student
>> M: (310) 990-4297
>> A: 3770 Keystone Ave. Apt 402,
>>      Los Angeles, CA 90034
>>
>>
>>
>>
>>
>> --
>> Cheng-Kang Hsieh
>> UCLA Computer Science PhD Student
>> M: (310) 990-4297
>> A: 3770 Keystone Ave. Apt 402,
>>      Los Angeles, CA 90034
>>
>>
>>
>>
>>
>> --
>> Cheng-Kang Hsieh
>> UCLA Computer Science PhD Student
>> M: (310) 990-4297
>> A: 3770 Keystone Ave. Apt 402,
>>      Los Angeles, CA 90034
>>
>>
>>
>
>
>
> --
> Cheng-Kang Hsieh
> UCLA Computer Science PhD Student
> M: (310) 990-4297
> A: 3770 Keystone Ave. Apt 402,
>      Los Angeles, CA 90034
>


-- 
*Abhishek Bhattacharjee*
*Pune Institute of Computer Technology*

--047d7b86ccbc0741d104f231799c
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div><div>Hi Cheng,<br></div>If you see the=
 repo that Aniket posted the link of and read its README you&#39;ll get wha=
t you are asking for in the above mail.<br></div>I&#39;ll repost the link h=
ere <a href=3D"https://github.com/pict2014/storm-redis" target=3D"_blank">h=
ttps://github.com/pict2014/storm-redis</a> . This does what you are asking =
for.<br>
</div>It uses <b>kafka</b> for replaying and <b>redis</b> for caching the i=
ntermediate state in batches. If you have a good understanding of storm the=
n you can read the <br></div>code and understand how it works. It uses tran=
sactional topologies. <br>
<br></div>Thanks,<br></div><div class=3D"gmail_extra"><br><br><div class=3D=
"gmail_quote">On Wed, Feb 12, 2014 at 4:17 AM, Cheng-Kang Hsieh (Andy) <spa=
n dir=3D"ltr">&lt;<a href=3D"mailto:changun@cs.ucla.edu" target=3D"_blank">=
changun@cs.ucla.edu</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Adrian<div><br></div><di=
v>Thank you so much for the input! If I understand how Spout works correctl=
y, wouldn&#39;t the tuple be regarded failed if it has not been fully acked=
 before the timeout? (which, by default, is 30 secs) From my understanding =
(can be totally wrong though), a storm-ish way to response to a failed tupl=
e is to call the <i>fail</i>&nbsp;method in the root Spout, which, in turn,=
 re-emits the failed tuple to the topology.&nbsp;</div>

<div><br></div><div>It will be nice, if there is a <i>fail </i>method in th=
e intermediate bolt that will be called when the down-streaming bolts faile=
d; then this bolt can just re-emit the intermediate results to the downstre=
aming bolt without restarting the process all the way up from the root spou=
t.</div>

<div><br></div><div>A use case of that will be that, says I have 3 componen=
ts chained together as follows: Spout -&gt; Bolt1 -&gt; Bolt2. What Bolt1 d=
oes is to aggregate the data within every fixed-size time window in a day a=
nd compute some measurements based on that.(e.g. the user&#39;s activities =
of each hour in a day). With the current design of storm, when the Bolt 2 f=
ails, the Spout has to manage to resend all the data in the corresponding t=
ime window for the Bolt 1 to recompute the results. It will be nice if Bolt=
1 can cache the results and resend it when the Bolt 2 fails.</div>

<div><br></div><div>Does it make any sense?</div><div>Any input is apprecia=
ted!</div><div><br></div><div>Best,</div><div>Andy</div></div><div class=3D=
"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><br><div class=3D=
"gmail_quote">
On Tue, Feb 11, 2014 at 5:03 PM, Adrian Mocanu <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:amocanu@verticalscope.com" target=3D"_blank">amocanu@verticalsc=
ope.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div link=3D"blue" vlink=3D"purple" lang=3D"EN-CA">
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">You can have acks from bo=
lt to bolt.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>&nbsp;<u></u></spa=
n></p>
<p class=3D"MsoNormal">Spout:
<u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;//ties in tuple to this UID<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">_collector.emit(new Values(queue.dequeue(), *uniqueID*)
<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>&nbsp;<u></u></spa=
n></p>
<p class=3D"MsoNormal">Then Bolt1 will ack the tuple only after it emits it=
 to Bolt2 so that the ack can be tied to the tuple<u></u><u></u></p>
<p class=3D"MsoNormal">Bolt1:
<u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;//emit first then ack<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">_collector.emit(tuple, new Values(&quot;stuff&quot;)) //**=
anchoring** - read below to see what it means<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">_collector.ack(tuple)
<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>&nbsp;<u></u></spa=
n></p>
<p class=3D"MsoNormal">At this point tuple from Spout has been acked in Bol=
t1, but at the same time the newly emitted tuple &quot;stuff&quot; to Bolt2=
 is &quot;anchored&quot; to the tuple from Spout. What this means is that
 it still needs to be acked later on otherwise on timeout it will be resent=
 by spout.<u></u><u></u></p>
<p class=3D"MsoNormal">Bolt2:<u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">_collector.ack(tuple)
<u></u><u></u></span></p>
<p class=3D"MsoNormal">Bolt2 needs to ack the tuple received from Bolt1 whi=
ch will send in the last ack that Spout was waiting for. If at this point B=
olt2 emits tuple, then there must be a Bolt3 which
 will get it and ack it. If the tuple is not acked at the last point, Spout=
 will time it out and resend it.
<u></u><u></u></p>
<p class=3D"MsoNormal">Each time anchoring is done on an
<span style=3D"font-size:10.0pt;font-family:&quot;Courier New&quot;">emit</=
span> statement from bolt to bolt, a new node in a &quot;tree&quot; structu=
re is built... well more like a list in my case since I never send the same=
 tuple to 2 or more tuples, I have a 1 to 1 relationship.
<u></u><u></u></p>
<p class=3D"MsoNormal">All nodes in the tree need to be acked, and only the=
n the tuple is marked as fully arrived. If the tuple is not acked and it is=
 sent with a UID and anchored later on then it will
 be kept in memory forever (until acked). <u></u><u></u></p>
<p class=3D"MsoNormal">Hope this helps.
<u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>&nbsp;<u></u></spa=
n></p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;" lang=3D"EN-US">From:</span></b><span=
 style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif=
&quot;" lang=3D"EN-US"> Tom Brown [mailto:<a href=3D"mailto:tombrown52@gmai=
l.com" target=3D"_blank">tombrown52@gmail.com</a>]
<br>
<b>Sent:</b> February-11-14 4:57 PM</span></p><div><div><br>
<b>To:</b> <a href=3D"mailto:user@storm.incubator.apache.org" target=3D"_bl=
ank">user@storm.incubator.apache.org</a><br>
<b>Subject:</b> Re: How to efficiently store the intermediate result of a b=
olt, and so it can be replayed after the crashes?<u></u><u></u></div></div>=
<p></p><div><div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
<div>
<p class=3D"MsoNormal">We use 2 storm topologies, with kafka in between: &n=
bsp;Kafka --&gt; TopologyA --&gt; Kafka --&gt; TopologyB --&gt; Final outpu=
t<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">This allows the two halves of computation to be scal=
ed and maintained independently.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">--Tom<u></u><u></u></p>
</div>
</div>
<div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt"><u></u>&nbsp;<u></u><=
/p>
<div>
<p class=3D"MsoNormal">On Tue, Feb 11, 2014 at 2:36 PM, Cheng-Kang Hsieh (A=
ndy) &lt;<a href=3D"mailto:changun@cs.ucla.edu" target=3D"_blank">changun@c=
s.ucla.edu</a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<div>
<p class=3D"MsoNormal">Hi Aniket &amp; Andrian,<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Thank you guys so much for the kind reply! Although =
the replies don&#39;t directly solve my problem, it has been very rewarding=
 following the code of redis-storm and Trident.&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">I guess storing the intermediate data in an external=
 db (like Cassandra, as suggested by Andrian) would work, but what if the B=
olt that is supposed to receive the intermediate data fails? In this case, =
the emitter is also a Bolt, and does
 not have the nice ACK mechanism to rely on, so the emitting Bolt might nev=
er know when it should resend the data to the receiving Bolt.<u></u><u></u>=
</p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">In other framework like Samza, or Spark Streaming, a=
ll the emitted data, no matter, by a Spout or Bolt is treated as the same w=
ay and so benefits from the same fault tolerance mechanism (they are not as=
 easy to use as Storm though). For
 example, in Samza, all the data output of a component are push to a Kafka =
queue with the receiving components as the listeners (see
<a href=3D"http://samza.incubator.apache.org/learn/documentation/0.7.0/cont=
ainer/state-management.html" target=3D"_blank">
here</a>).&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Conceptually maybe a more general solution for Storm=
 is to make a Bolt also a Spout which can receive ACKs from the receiving B=
olts; however it seems to violate the assumption of Storm?<u></u><u></u></p=
>


</div>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Again I appreciate any advice or suggestion. Thank y=
ou!<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Best,<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Andy<u></u><u></u></p>
</div>
</div>
<div>
<div>
<div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt"><u></u>&nbsp;<u></u><=
/p>
<div>
<p class=3D"MsoNormal">On Fri, Feb 7, 2014 at 9:37 AM, Adrian Mocanu &lt;<a=
 href=3D"mailto:amocanu@verticalscope.com" target=3D"_blank">amocanu@vertic=
alscope.com</a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Hi Andy,</span><u></u><u>=
</u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">I think you can use Tride=
nt to persist the results at any point in your stream processing.</span><u>=
</u><u></u></p>


<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">I believe the way you do =
that is by using STREAM.persistentAggregate(&hellip;)</span><u></u><u></u><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Here&rsquo;s an example f=
rom
<a href=3D"https://github.com/nathanmarz/storm/wiki/Trident-tutorial" targe=
t=3D"_blank">
https://github.com/nathanmarz/storm/wiki/Trident-tutorial</a></span><u></u>=
<u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">TridentTopology topology =3D new TridentTopology();&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">TridentState wordCounts =3D</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp; topology.newStream(&quot;spout1&q=
uot;, spout)</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .each(new Fields(&quo=
t;sentence&quot;), new Split(), new Fields(&quot;word&quot;))</span><u></u>=
<u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .groupBy(new Fields(&=
quot;word&quot;))</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .persistentAggregate(=
new MemoryMapState.Factory(), new Count(), new Fields(&quot;count&quot;))&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp;&nbsp;
</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.parallelismHint=
(6);</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">In this case the counts (=
re[place counts with whatever operations you are doing) are stored in a
 memory map, but you can make another class that saves this intermediate re=
sult to a db&hellip; at least that&rsquo;s my understanding&hellip; I am cu=
rrently also learning these things.</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">I&rsquo;m currently worki=
ng on a similar problem and I&rsquo;m attempting to store into Cassandra. F=
eel free
 to watch my conversation threads (with Svend and Taylor Goetz)</span><u></=
u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">-A</span><u></u><u></u></=
p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;" lang=3D"EN-US">From:</span></b><span=
 style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif=
&quot;" lang=3D"EN-US">
 Aniket Alhat [mailto:<a href=3D"mailto:aniket.alhat@gmail.com" target=3D"_=
blank">aniket.alhat@gmail.com</a>]
<br>
<b>Sent:</b> February-06-14 11:57 PM<br>
<b>To:</b> <a href=3D"mailto:user@storm.incubator.apache.org" target=3D"_bl=
ank">user@storm.incubator.apache.org</a><br>
<b>Subject:</b> Re: How to efficiently store the intermediate result of a b=
olt, and so it can be replayed after the crashes?</span><u></u><u></u></p>
<div>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
<p>I hope this helps<u></u><u></u></p>
<p><a href=3D"https://github.com/pict2014/storm-redis" target=3D"_blank">ht=
tps://github.com/pict2014/storm-redis</a><u></u><u></u></p>
<div>
<p class=3D"MsoNormal">On Feb 7, 2014 12:07 AM, &quot;Cheng-Kang Hsieh (And=
y)&quot; &lt;<a href=3D"mailto:changun@cs.ucla.edu" target=3D"_blank">chang=
un@cs.ucla.edu</a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-=
bottom:5.0pt">
<div>
<p class=3D"MsoNormal">Sorry, I realized that question was badly written. S=
imply put, my question is that is there a recommended way to store the tupl=
es emitted by a BOLT so that the tuples can be replayed
 after crash without repeating the process all the way up from the source s=
pout? any advice would be appreciated. Thank you!<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Best,<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Andy<u></u><u></u></p>
</div>
</div>
<div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">&nbsp;<u></u><u></u><=
/p>
<div>
<p class=3D"MsoNormal">On Tue, Feb 4, 2014 at 11:58 AM, Cheng-Kang Hsieh (A=
ndy) &lt;<a href=3D"mailto:changun@cs.ucla.edu" target=3D"_blank">changun@c=
s.ucla.edu</a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-=
bottom:5.0pt">
<div>
<p class=3D"MsoNormal">Hi all,<br>
<br>
First of all, Thank Nathan and all the contributors for pulling out such a<=
br>
great framework! I am learning a lot, even just reading the discussion<br>
threads.<br>
<br>
I am building a topology that contains one spout along with a chain of<br>
bolts. (e.g. S -&gt; A &nbsp;-&gt; B, where S is the spout, A, B are bolts.=
)<br>
<br>
When S emits a tuple, the next bolt A &nbsp;will buffer the tuple in a DFS,=
 and<br>
compute some aggregated values when it has received a sufficient amount of<=
br>
data and then emit the aggregation results to the next bolt B.<br>
<br>
Here comes my question, is there a recommended way to store the<br>
intermediate results emitted by a bolt, so that when machine crashes, the<b=
r>
results can be replayed to the downstreaming bolts (i.e. bolt B)?<br>
<br>
One possible solution could be that: Don&#39;t keep any intermediate result=
s,<br>
but resort to the storm&#39;s ack framework, so that the raw data will be<b=
r>
replay from spout S when crash happened.<br>
<br>
However, this approach might not be appropriate in my case, as it might<br>
take pretty long time (like a couple of hours) before bolt A has received<b=
r>
all the required data and emit the aggregated results, so that it will be<b=
r>
very expensive for ack framework to keep tracking that many tuples for that=
<br>
long.<br>
<br>
An alternative solution could be: *making bolt A also a spout* and keep the=
<br>
emitted data in a DFS queue. When a result has been acked, the bolt A<br>
removes it from the queue.<br>
<br>
I am wondering if it is reasonable to make a task both bolt and spout at<br=
>
the same time? or if there is any better approach to do so.<br>
<br>
Thank you!<br>
<br>
--<br>
Cheng-Kang Hsieh<br>
UCLA Computer Science PhD Student<br>
M: <a href=3D"tel:%28310%29%20990-4297" target=3D"_blank">(310) 990-4297</a=
><br>
A: 3770 Keystone Ave. Apt 402,<br>
&nbsp; &nbsp; &nbsp;Los Angeles, CA 90034<u></u><u></u></p>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><br>
<br clear=3D"all">
<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">--
<br>
Cheng-Kang Hsieh<br>
UCLA Computer Science PhD Student<br>
M: <a href=3D"tel:%28310%29%20990-4297" target=3D"_blank">(310) 990-4297</a=
><br>
A: 3770 Keystone Ave. Apt 402, <br>
&nbsp; &nbsp; &nbsp;Los Angeles, CA 90034 <u></u><u></u></p>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><br>
<br clear=3D"all">
<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<p class=3D"MsoNormal">-- <br>
Cheng-Kang Hsieh<br>
UCLA Computer Science PhD Student<br>
M: <a href=3D"tel:%28310%29%20990-4297" target=3D"_blank">(310) 990-4297</a=
><br>
A: 3770 Keystone Ave. Apt 402, <br>
&nbsp; &nbsp; &nbsp;Los Angeles, CA 90034 <u></u><u></u></p>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
</div></div></div>
</div>

</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>Cheng-Kang H=
sieh<br>UCLA Computer Science PhD Student<br>M: (310) 990-4297<br>A: 3770 K=
eystone Ave. Apt 402, <br>&nbsp; &nbsp; &nbsp;Los Angeles, CA 90034
</div>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br><div dir=3D=
"ltr"><b>Abhishek Bhattacharjee</b><div><b><font size=3D"1">Pune Institute =
of Computer Technology</font></b></div></div>
</div>

--047d7b86ccbc0741d104f231799c--