Mailing-List: contact user-help@storm.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@storm.incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of changun.tw@gmail.com designates
 209.85.212.175 as permitted sender)
MIME-Version: 1.0
Sender: changun.tw@gmail.com
In-Reply-To: 
 <8fa322ae9411464eb545bb6f7824dfce@CO2PR07MB522.namprd07.prod.outlook.com>
References: 
 <CAJsx-0pRd4ewd3v22FL59XrSDKjbE_K097m0SXsugX7bPOVq+w@mail.gmail.com>
	<CAJsx-0q-mV++Ei_+NY+GD9-X=tUZxoHtH_zoAZNr8YPY+W-ApQ@mail.gmail.com>
	<CAGW7ORgUgmHr-kOsRj7O8WAtEb-E5o5qP+jx5mCxjSvcsE31Qw@mail.gmail.com>
	<545c18be1c664d35877e30423de74118@CO2PR07MB522.namprd07.prod.outlook.com>
	<CAJsx-0r+ux7k_ZkDpRPN=Jdp9FWHW-GNcxA97hmPXhj+UMtpKw@mail.gmail.com>
	<CAHtRq2ftw8-j5sPVr89Ca9HNkSQ4F_3-ztB73v=jwQ2eMoyCWA@mail.gmail.com>
	<1279108c9d714c22ae0618126081ef72@CO2PR07MB522.namprd07.prod.outlook.com>
	<CAJsx-0pE5ONn1__4GJ3WWFVnuq0dYvTH1B9AY8kGw2s3SMyvaw@mail.gmail.com>
	<CAHz+tL2sGDq=yWb7sFxivi12m6FxeVHDK=VKUC5DCSeSQdfsGw@mail.gmail.com>
	<8fa322ae9411464eb545bb6f7824dfce@CO2PR07MB522.namprd07.prod.outlook.com>
Date: Wed, 12 Feb 2014 10:51:29 -0500
Message-ID: 
 <CAJsx-0p53daazaXZS6ArtcFcTmBL+CcAtfaRuKVvRGWGaf-ydw@mail.gmail.com>
Subject: Re: How to efficiently store the intermediate result of a bolt, and
 so it can be replayed after the crashes?
From: "Cheng-Kang Hsieh (Andy)" <changun@cs.ucla.edu>
To: user@storm.incubator.apache.org
Content-Type: multipart/alternative; boundary=001a11c383beca7a3104f2378bd5

--001a11c383beca7a3104f2378bd5
Content-Type: text/plain; charset=ISO-8859-1

Hi Adrian,

Yes that is my understanding too. Sometimes I am wondering if it is not a
good idea to use Storm to perform such computation (i.e. aggregate the data
over a certain time window while the aggregation operation is not additive)

Abhishek, thank you for pointing that out! I kinda get the point of
storm-redis, but didn't think about extending the same idea into a larger
topology.

My idea now (similar to what Tom suggest) is to have the Bolt in the middle
output the data to a kafka queue, and make a spout to listen to this queue
and emit the data to the downstreaing bolts, as well as, handling the
re-emission upon failure. Ideally, I want to make a TopologyBuilder to
automatically construct this, so that, on the surface, it would look the
same as a ordinary topology.

Does it make any sense?

Thank you all so much for the kind replies. This community is great!

Best,
Andy


On Wed, Feb 12, 2014 at 10:22 AM, Adrian Mocanu
<amocanu@verticalscope.com>wrote:

>  Hi
>
>
>
> You can fail a tuple from any intermediate bolt, but AFAIK you can't make
> it not resend from the spout. So your precomputed cached result is useless.
> I know in spark you can save your RDD to a db/cache but that wouldn't work
> in storm.
>
>
>
> If I'm wrong someone correct me.
>
>
>
> A
>
>
>
> *From:* Abhishek Bhattacharjee [mailto:abhishek.bhattacharjee11@gmail.com]
>
> *Sent:* February-12-14 3:37 AM
>
> *To:* user@storm.incubator.apache.org
> *Subject:* Re: How to efficiently store the intermediate result of a
> bolt, and so it can be replayed after the crashes?
>
>
>
> Hi Cheng,
>
> If you see the repo that Aniket posted the link of and read its README
> you'll get what you are asking for in the above mail.
>
> I'll repost the link here https://github.com/pict2014/storm-redis . This
> does what you are asking for.
>
> It uses *kafka* for replaying and *redis* for caching the intermediate
> state in batches. If you have a good understanding of storm then you can
> read the
>
> code and understand how it works. It uses transactional topologies.
>
> Thanks,
>
>
>
> On Wed, Feb 12, 2014 at 4:17 AM, Cheng-Kang Hsieh (Andy) <
> changun@cs.ucla.edu> wrote:
>
>  Hi Adrian
>
>
>
> Thank you so much for the input! If I understand how Spout works
> correctly, wouldn't the tuple be regarded failed if it has not been fully
> acked before the timeout? (which, by default, is 30 secs) From my
> understanding (can be totally wrong though), a storm-ish way to response to
> a failed tuple is to call the *fail* method in the root Spout, which, in
> turn, re-emits the failed tuple to the topology.
>
>
>
> It will be nice, if there is a *fail *method in the intermediate bolt
> that will be called when the down-streaming bolts failed; then this bolt
> can just re-emit the intermediate results to the downstreaming bolt without
> restarting the process all the way up from the root spout.
>
>
>
> A use case of that will be that, says I have 3 components chained together
> as follows: Spout -> Bolt1 -> Bolt2. What Bolt1 does is to aggregate the
> data within every fixed-size time window in a day and compute some
> measurements based on that.(e.g. the user's activities of each hour in a
> day). With the current design of storm, when the Bolt 2 fails, the Spout
> has to manage to resend all the data in the corresponding time window for
> the Bolt 1 to recompute the results. It will be nice if Bolt1 can cache the
> results and resend it when the Bolt 2 fails.
>
>
>
> Does it make any sense?
>
> Any input is appreciated!
>
>
>
> Best,
>
> Andy
>
>
>
> On Tue, Feb 11, 2014 at 5:03 PM, Adrian Mocanu <amocanu@verticalscope.com>
> wrote:
>
>  You can have acks from bolt to bolt.
>
>
>
> Spout:
>
>  //ties in tuple to this UID
>
> _collector.emit(new Values(queue.dequeue(), *uniqueID*)
>
>
>
> Then Bolt1 will ack the tuple only after it emits it to Bolt2 so that the
> ack can be tied to the tuple
>
> Bolt1:
>
>  //emit first then ack
>
> _collector.emit(tuple, new Values("stuff")) //**anchoring** - read below
> to see what it means
>
> _collector.ack(tuple)
>
>
>
> At this point tuple from Spout has been acked in Bolt1, but at the same
> time the newly emitted tuple "stuff" to Bolt2 is "anchored" to the tuple
> from Spout. What this means is that it still needs to be acked later on
> otherwise on timeout it will be resent by spout.
>
> Bolt2:
>
> _collector.ack(tuple)
>
> Bolt2 needs to ack the tuple received from Bolt1 which will send in the
> last ack that Spout was waiting for. If at this point Bolt2 emits tuple,
> then there must be a Bolt3 which will get it and ack it. If the tuple is
> not acked at the last point, Spout will time it out and resend it.
>
> Each time anchoring is done on an emit statement from bolt to bolt, a new
> node in a "tree" structure is built... well more like a list in my case
> since I never send the same tuple to 2 or more tuples, I have a 1 to 1
> relationship.
>
> All nodes in the tree need to be acked, and only then the tuple is marked
> as fully arrived. If the tuple is not acked and it is sent with a UID and
> anchored later on then it will be kept in memory forever (until acked).
>
> Hope this helps.
>
>
>
> *From:* Tom Brown [mailto:tombrown52@gmail.com]
> *Sent:* February-11-14 4:57 PM
>
>
> *To:* user@storm.incubator.apache.org
> *Subject:* Re: How to efficiently store the intermediate result of a
> bolt, and so it can be replayed after the crashes?
>
>
>
> We use 2 storm topologies, with kafka in between:  Kafka --> TopologyA -->
> Kafka --> TopologyB --> Final output
>
>
>
> This allows the two halves of computation to be scaled and maintained
> independently.
>
>
>
> --Tom
>
>
>
> On Tue, Feb 11, 2014 at 2:36 PM, Cheng-Kang Hsieh (Andy) <
> changun@cs.ucla.edu> wrote:
>
>  Hi Aniket & Andrian,
>
>
>
> Thank you guys so much for the kind reply! Although the replies don't
> directly solve my problem, it has been very rewarding following the code of
> redis-storm and Trident.
>
>
>
> I guess storing the intermediate data in an external db (like Cassandra,
> as suggested by Andrian) would work, but what if the Bolt that is supposed
> to receive the intermediate data fails? In this case, the emitter is also a
> Bolt, and does not have the nice ACK mechanism to rely on, so the emitting
> Bolt might never know when it should resend the data to the receiving Bolt.
>
>
>
> In other framework like Samza, or Spark Streaming, all the emitted data,
> no matter, by a Spout or Bolt is treated as the same way and so benefits
> from the same fault tolerance mechanism (they are not as easy to use as
> Storm though). For example, in Samza, all the data output of a component
> are push to a Kafka queue with the receiving components as the listeners
> (see here<http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html>
> ).
>
>
>
> Conceptually maybe a more general solution for Storm is to make a Bolt
> also a Spout which can receive ACKs from the receiving Bolts; however it
> seems to violate the assumption of Storm?
>
>
>
> Again I appreciate any advice or suggestion. Thank you!
>
>
>
> Best,
>
> Andy
>
>
>
> On Fri, Feb 7, 2014 at 9:37 AM, Adrian Mocanu <amocanu@verticalscope.com>
> wrote:
>
>  Hi Andy,
>
> I think you can use Trident to persist the results at any point in your
> stream processing.
>
> I believe the way you do that is by using STREAM.persistentAggregate(...)
>
>
>
> Here's an example from
> https://github.com/nathanmarz/storm/wiki/Trident-tutorial
>
>
>
> TridentTopology topology = new TridentTopology();
>
> TridentState wordCounts =
>
>      topology.newStream("spout1", spout)
>
>        .each(new Fields("sentence"), new Split(), new Fields("word"))
>
>        .groupBy(new Fields("word"))
>
>        .persistentAggregate(new MemoryMapState.Factory(), new Count(), new
> Fields("count"))
>
>        .parallelismHint(6);
>
>
>
> In this case the counts (re[place counts with whatever operations you are
> doing) are stored in a memory map, but you can make another class that
> saves this intermediate result to a db... at least that's my understanding... I
> am currently also learning these things.
>
> I'm currently working on a similar problem and I'm attempting to store
> into Cassandra. Feel free to watch my conversation threads (with Svend and
> Taylor Goetz)
>
>
>
> -A
>
>
>
> *From:* Aniket Alhat [mailto:aniket.alhat@gmail.com]
> *Sent:* February-06-14 11:57 PM
> *To:* user@storm.incubator.apache.org
> *Subject:* Re: How to efficiently store the intermediate result of a
> bolt, and so it can be replayed after the crashes?
>
>
>
> I hope this helps
>
> https://github.com/pict2014/storm-redis
>
> On Feb 7, 2014 12:07 AM, "Cheng-Kang Hsieh (Andy)" <changun@cs.ucla.edu>
> wrote:
>
>  Sorry, I realized that question was badly written. Simply put, my
> question is that is there a recommended way to store the tuples emitted by
> a BOLT so that the tuples can be replayed after crash without repeating the
> process all the way up from the source spout? any advice would be
> appreciated. Thank you!
>
>
>
> Best,
>
> Andy
>
>
>
> On Tue, Feb 4, 2014 at 11:58 AM, Cheng-Kang Hsieh (Andy) <
> changun@cs.ucla.edu> wrote:
>
>  Hi all,
>
> First of all, Thank Nathan and all the contributors for pulling out such a
> great framework! I am learning a lot, even just reading the discussion
> threads.
>
> I am building a topology that contains one spout along with a chain of
> bolts. (e.g. S -> A  -> B, where S is the spout, A, B are bolts.)
>
> When S emits a tuple, the next bolt A  will buffer the tuple in a DFS, and
> compute some aggregated values when it has received a sufficient amount of
> data and then emit the aggregation results to the next bolt B.
>
> Here comes my question, is there a recommended way to store the
> intermediate results emitted by a bolt, so that when machine crashes, the
> results can be replayed to the downstreaming bolts (i.e. bolt B)?
>
> One possible solution could be that: Don't keep any intermediate results,
> but resort to the storm's ack framework, so that the raw data will be
> replay from spout S when crash happened.
>
> However, this approach might not be appropriate in my case, as it might
> take pretty long time (like a couple of hours) before bolt A has received
> all the required data and emit the aggregated results, so that it will be
> very expensive for ack framework to keep tracking that many tuples for that
> long.
>
> An alternative solution could be: *making bolt A also a spout* and keep the
> emitted data in a DFS queue. When a result has been acked, the bolt A
> removes it from the queue.
>
> I am wondering if it is reasonable to make a task both bolt and spout at
> the same time? or if there is any better approach to do so.
>
> Thank you!
>
> --
> Cheng-Kang Hsieh
> UCLA Computer Science PhD Student
> M: (310) 990-4297
> A: 3770 Keystone Ave. Apt 402,
>      Los Angeles, CA 90034
>
>
>
>
>
> --
> Cheng-Kang Hsieh
> UCLA Computer Science PhD Student
> M: (310) 990-4297
> A: 3770 Keystone Ave. Apt 402,
>      Los Angeles, CA 90034
>
>
>
>
>
> --
> Cheng-Kang Hsieh
> UCLA Computer Science PhD Student
> M: (310) 990-4297
> A: 3770 Keystone Ave. Apt 402,
>      Los Angeles, CA 90034
>
>
>
>
>
>
>
> --
> Cheng-Kang Hsieh
> UCLA Computer Science PhD Student
> M: (310) 990-4297
> A: 3770 Keystone Ave. Apt 402,
>      Los Angeles, CA 90034
>
>
>
>
> --
>
> *Abhishek Bhattacharjee*
>
> *Pune Institute of Computer Technology*
>


-- 
Cheng-Kang Hsieh
UCLA Computer Science PhD Student
M: (310) 990-4297
A: 3770 Keystone Ave. Apt 402,
     Los Angeles, CA 90034

--001a11c383beca7a3104f2378bd5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Adrian,<div><br></div><div>Yes that is my understanding=
 too. Sometimes I am wondering if it is not a good idea to use Storm to per=
form such computation (i.e. aggregate the data over a certain time window w=
hile the aggregation operation is not additive)</div>
<div><br></div><div>Abhishek, thank you for pointing that out! I kinda get =
the point of storm-redis, but didn&#39;t think about extending the same ide=
a into a larger topology.&nbsp;</div><div><br></div><div>My idea now (simil=
ar to what Tom suggest) is to have the Bolt in the middle output the data t=
o a kafka queue, and make a spout to listen to this queue and emit the data=
 to the downstreaing bolts, as well as, handling the re-emission upon failu=
re. Ideally, I want to make a TopologyBuilder to automatically construct th=
is, so that, on the surface, it would look the same as a ordinary topology.=
&nbsp;</div>
<div><br></div><div>Does it make any sense?</div><div><br></div><div>Thank =
you all so much for the kind replies. This community is great!</div><div><b=
r></div><div>Best,</div><div>Andy</div><div><br></div></div><div class=3D"g=
mail_extra">
<br><br><div class=3D"gmail_quote">On Wed, Feb 12, 2014 at 10:22 AM, Adrian=
 Mocanu <span dir=3D"ltr">&lt;<a href=3D"mailto:amocanu@verticalscope.com" =
target=3D"_blank">amocanu@verticalscope.com</a>&gt;</span> wrote:<br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex">


<div lang=3D"EN-CA" link=3D"blue" vlink=3D"purple">
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Hi<u></u><u></u></span></=
p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>&nbsp;<u></u></spa=
n></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">You can fail a tuple from=
 any intermediate bolt, but AFAIK you can&rsquo;t make it not resend from t=
he spout. So your precomputed cached result is useless. I know
 in spark you can save your RDD to a db/cache but that wouldn&rsquo;t work =
in storm.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>&nbsp;<u></u></spa=
n></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">If I&rsquo;m wrong someon=
e correct me.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>&nbsp;<u></u></spa=
n></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">A<u></u><u></u></span></p=
>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>&nbsp;<u></u></spa=
n></p>
<p class=3D"MsoNormal"><b><span lang=3D"EN-US" style=3D"font-size:11.0pt;fo=
nt-family:&quot;Calibri&quot;,&quot;sans-serif&quot;">From:</span></b><span=
 lang=3D"EN-US" style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&=
quot;sans-serif&quot;"> Abhishek Bhattacharjee [mailto:<a href=3D"mailto:ab=
hishek.bhattacharjee11@gmail.com" target=3D"_blank">abhishek.bhattacharjee1=
1@gmail.com</a>]
<br>
<b>Sent:</b> February-12-14 3:37 AM</span></p><div><div class=3D"h5"><br>
<b>To:</b> <a href=3D"mailto:user@storm.incubator.apache.org" target=3D"_bl=
ank">user@storm.incubator.apache.org</a><br>
<b>Subject:</b> Re: How to efficiently store the intermediate result of a b=
olt, and so it can be replayed after the crashes?<u></u><u></u></div></div>=
<p></p><div><div class=3D"h5">
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
<div>
<div>
<div>
<div>
<div>
<div>
<p class=3D"MsoNormal">Hi Cheng,<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">If you see the repo that Aniket posted the link of a=
nd read its README you&#39;ll get what you are asking for in the above mail=
.<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">I&#39;ll repost the link here <a href=3D"https://git=
hub.com/pict2014/storm-redis" target=3D"_blank">
https://github.com/pict2014/storm-redis</a> . This does what you are asking=
 for.<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">It uses <b>kafka</b> for replaying and <b>redis</b> =
for caching the intermediate state in batches. If you have a good understan=
ding of storm then you can read the
<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">code and understand h=
ow it works. It uses transactional topologies.
<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">Thanks,<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt"><u></u>&nbsp;<u></u><=
/p>
<div>
<p class=3D"MsoNormal">On Wed, Feb 12, 2014 at 4:17 AM, Cheng-Kang Hsieh (A=
ndy) &lt;<a href=3D"mailto:changun@cs.ucla.edu" target=3D"_blank">changun@c=
s.ucla.edu</a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<div>
<p class=3D"MsoNormal">Hi Adrian<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Thank you so much for the input! If I understand how=
 Spout works correctly, wouldn&#39;t the tuple be regarded failed if it has=
 not been fully acked before the timeout? (which, by default, is 30 secs) F=
rom my understanding (can be totally wrong
 though), a storm-ish way to response to a failed tuple is to call the <i>f=
ail</i>&nbsp;method in the root Spout, which, in turn, re-emits the failed =
tuple to the topology.&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">It will be nice, if there is a <i>fail </i>method in=
 the intermediate bolt that will be called when the down-streaming bolts fa=
iled; then this bolt can just re-emit the intermediate results to the downs=
treaming bolt without restarting the
 process all the way up from the root spout.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">A use case of that will be that, says I have 3 compo=
nents chained together as follows: Spout -&gt; Bolt1 -&gt; Bolt2. What Bolt=
1 does is to aggregate the data within every fixed-size time window in a da=
y and compute some measurements based on
 that.(e.g. the user&#39;s activities of each hour in a day). With the curr=
ent design of storm, when the Bolt 2 fails, the Spout has to manage to rese=
nd all the data in the corresponding time window for the Bolt 1 to recomput=
e the results. It will be nice if Bolt1
 can cache the results and resend it when the Bolt 2 fails.<u></u><u></u></=
p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Does it make any sense?<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Any input is appreciated!<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Best,<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Andy<u></u><u></u></p>
</div>
</div>
<div>
<div>
<div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt"><u></u>&nbsp;<u></u><=
/p>
<div>
<p class=3D"MsoNormal">On Tue, Feb 11, 2014 at 5:03 PM, Adrian Mocanu &lt;<=
a href=3D"mailto:amocanu@verticalscope.com" target=3D"_blank">amocanu@verti=
calscope.com</a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">You can have acks from bo=
lt to bolt.</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal">Spout:
<u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;//ties in tuple to this UID</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">_collector.emit(new Values(queue.dequeue(), *uniqueID*)
</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal">Then Bolt1 will ack the tuple only after it emits it=
 to Bolt2 so that the ack can be tied to the tuple<u></u><u></u></p>
<p class=3D"MsoNormal">Bolt1:
<u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;//emit first then ack</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">_collector.emit(tuple, new Values(&quot;stuff&quot;)) //**=
anchoring** - read below to see what it means</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">_collector.ack(tuple)
</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal">At this point tuple from Spout has been acked in Bol=
t1, but at the same time the newly emitted tuple &quot;stuff&quot; to Bolt2=
 is &quot;anchored&quot; to the tuple from Spout. What this means is that
 it still needs to be acked later on otherwise on timeout it will be resent=
 by spout.<u></u><u></u></p>
<p class=3D"MsoNormal">Bolt2:<u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">_collector.ack(tuple)
</span><u></u><u></u></p>
<p class=3D"MsoNormal">Bolt2 needs to ack the tuple received from Bolt1 whi=
ch will send in the last ack that Spout was waiting for. If at this point B=
olt2 emits tuple, then there must be a Bolt3 which
 will get it and ack it. If the tuple is not acked at the last point, Spout=
 will time it out and resend it.
<u></u><u></u></p>
<p class=3D"MsoNormal">Each time anchoring is done on an
<span style=3D"font-size:10.0pt;font-family:&quot;Courier New&quot;">emit</=
span> statement from bolt to bolt, a new node in a &quot;tree&quot; structu=
re is built... well more like a list in my case since I never send the same=
 tuple to 2 or more tuples, I have a 1 to 1 relationship.
<u></u><u></u></p>
<p class=3D"MsoNormal">All nodes in the tree need to be acked, and only the=
n the tuple is marked as fully arrived. If the tuple is not acked and it is=
 sent with a UID and anchored later on then it will
 be kept in memory forever (until acked). <u></u><u></u></p>
<p class=3D"MsoNormal">Hope this helps.
<u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal"><b><span lang=3D"EN-US" style=3D"font-size:11.0pt;fo=
nt-family:&quot;Calibri&quot;,&quot;sans-serif&quot;">From:</span></b><span=
 lang=3D"EN-US" style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&=
quot;sans-serif&quot;">
 Tom Brown [mailto:<a href=3D"mailto:tombrown52@gmail.com" target=3D"_blank=
">tombrown52@gmail.com</a>]
<br>
<b>Sent:</b> February-11-14 4:57 PM</span><u></u><u></u></p>
<div>
<div>
<p class=3D"MsoNormal"><br>
<b>To:</b> <a href=3D"mailto:user@storm.incubator.apache.org" target=3D"_bl=
ank">user@storm.incubator.apache.org</a><br>
<b>Subject:</b> Re: How to efficiently store the intermediate result of a b=
olt, and so it can be replayed after the crashes?<u></u><u></u></p>
</div>
</div>
<div>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">We use 2 storm topologies, with kafka in between: &n=
bsp;Kafka --&gt; TopologyA --&gt; Kafka --&gt; TopologyB --&gt; Final outpu=
t<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">This allows the two halves of computation to be scal=
ed and maintained independently.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">--Tom<u></u><u></u></p>
</div>
</div>
<div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">&nbsp;<u></u><u></u><=
/p>
<div>
<p class=3D"MsoNormal">On Tue, Feb 11, 2014 at 2:36 PM, Cheng-Kang Hsieh (A=
ndy) &lt;<a href=3D"mailto:changun@cs.ucla.edu" target=3D"_blank">changun@c=
s.ucla.edu</a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-=
bottom:5.0pt">
<div>
<p class=3D"MsoNormal">Hi Aniket &amp; Andrian,<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Thank you guys so much for the kind reply! Although =
the replies don&#39;t directly solve my problem, it has been very rewarding=
 following the code of redis-storm and Trident.&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">I guess storing the intermediate data in an external=
 db (like Cassandra, as suggested by Andrian) would work, but what if the B=
olt that is supposed to receive the intermediate data
 fails? In this case, the emitter is also a Bolt, and does not have the nic=
e ACK mechanism to rely on, so the emitting Bolt might never know when it s=
hould resend the data to the receiving Bolt.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">In other framework like Samza, or Spark Streaming, a=
ll the emitted data, no matter, by a Spout or Bolt is treated as the same w=
ay and so benefits from the same fault tolerance mechanism
 (they are not as easy to use as Storm though). For example, in Samza, all =
the data output of a component are push to a Kafka queue with the receiving=
 components as the listeners (see
<a href=3D"http://samza.incubator.apache.org/learn/documentation/0.7.0/cont=
ainer/state-management.html" target=3D"_blank">
here</a>).&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Conceptually maybe a more general solution for Storm=
 is to make a Bolt also a Spout which can receive ACKs from the receiving B=
olts; however it seems to violate the assumption of
 Storm?<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Again I appreciate any advice or suggestion. Thank y=
ou!<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Best,<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Andy<u></u><u></u></p>
</div>
</div>
<div>
<div>
<div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">&nbsp;<u></u><u></u><=
/p>
<div>
<p class=3D"MsoNormal">On Fri, Feb 7, 2014 at 9:37 AM, Adrian Mocanu &lt;<a=
 href=3D"mailto:amocanu@verticalscope.com" target=3D"_blank">amocanu@vertic=
alscope.com</a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-=
bottom:5.0pt">
<div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Hi Andy,</span><u></u><u>=
</u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">I think you can use Tride=
nt to persist the results at any point in your stream processing.</span><u>=
</u><u></u></p>

<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">I believe the way you do =
that is by using STREAM.persistentAggregate(&hellip;)</span><u></u><u></u><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Here&rsquo;s an example f=
rom
<a href=3D"https://github.com/nathanmarz/storm/wiki/Trident-tutorial" targe=
t=3D"_blank">
https://github.com/nathanmarz/storm/wiki/Trident-tutorial</a></span><u></u>=
<u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">TridentTopology topology =3D new TridentTopology();&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">TridentState wordCounts =3D</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp; topology.newStream(&quot;spout1&q=
uot;, spout)</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .each(new Fields(&quo=
t;sentence&quot;), new Split(), new Fields(&quot;word&quot;))</span><u></u>=
<u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .groupBy(new Fields(&=
quot;word&quot;))</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .persistentAggregate(=
new MemoryMapState.Factory(), new Count(), new Fields(&quot;count&quot;))&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp;&nbsp;
</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:&quot;Co=
urier New&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.parallelismHint=
(6);</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">In this case the counts (=
re[place counts with whatever operations you are doing) are stored in a
 memory map, but you can make another class that saves this intermediate re=
sult to a db&hellip; at least that&rsquo;s my understanding&hellip; I am cu=
rrently also learning these things.</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">I&rsquo;m currently worki=
ng on a similar problem and I&rsquo;m attempting to store into Cassandra. F=
eel free
 to watch my conversation threads (with Svend and Taylor Goetz)</span><u></=
u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">-A</span><u></u><u></u></=
p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">&nbsp;</span><u></u><u></=
u></p>
<p class=3D"MsoNormal"><b><span lang=3D"EN-US" style=3D"font-size:11.0pt;fo=
nt-family:&quot;Calibri&quot;,&quot;sans-serif&quot;">From:</span></b><span=
 lang=3D"EN-US" style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&=
quot;sans-serif&quot;">
 Aniket Alhat [mailto:<a href=3D"mailto:aniket.alhat@gmail.com" target=3D"_=
blank">aniket.alhat@gmail.com</a>]
<br>
<b>Sent:</b> February-06-14 11:57 PM<br>
<b>To:</b> <a href=3D"mailto:user@storm.incubator.apache.org" target=3D"_bl=
ank">user@storm.incubator.apache.org</a><br>
<b>Subject:</b> Re: How to efficiently store the intermediate result of a b=
olt, and so it can be replayed after the crashes?</span><u></u><u></u></p>
<div>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
<p>I hope this helps<u></u><u></u></p>
<p><a href=3D"https://github.com/pict2014/storm-redis" target=3D"_blank">ht=
tps://github.com/pict2014/storm-redis</a><u></u><u></u></p>
<div>
<p class=3D"MsoNormal">On Feb 7, 2014 12:07 AM, &quot;Cheng-Kang Hsieh (And=
y)&quot; &lt;<a href=3D"mailto:changun@cs.ucla.edu" target=3D"_blank">chang=
un@cs.ucla.edu</a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-=
bottom:5.0pt">
<div>
<p class=3D"MsoNormal">Sorry, I realized that question was badly written. S=
imply put, my question is that is there a recommended way to store the tupl=
es emitted by a BOLT so that the tuples can be replayed
 after crash without repeating the process all the way up from the source s=
pout? any advice would be appreciated. Thank you!<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Best,<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Andy<u></u><u></u></p>
</div>
</div>
<div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">&nbsp;<u></u><u></u><=
/p>
<div>
<p class=3D"MsoNormal">On Tue, Feb 4, 2014 at 11:58 AM, Cheng-Kang Hsieh (A=
ndy) &lt;<a href=3D"mailto:changun@cs.ucla.edu" target=3D"_blank">changun@c=
s.ucla.edu</a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-=
bottom:5.0pt">
<div>
<p class=3D"MsoNormal">Hi all,<br>
<br>
First of all, Thank Nathan and all the contributors for pulling out such a<=
br>
great framework! I am learning a lot, even just reading the discussion<br>
threads.<br>
<br>
I am building a topology that contains one spout along with a chain of<br>
bolts. (e.g. S -&gt; A &nbsp;-&gt; B, where S is the spout, A, B are bolts.=
)<br>
<br>
When S emits a tuple, the next bolt A &nbsp;will buffer the tuple in a DFS,=
 and<br>
compute some aggregated values when it has received a sufficient amount of<=
br>
data and then emit the aggregation results to the next bolt B.<br>
<br>
Here comes my question, is there a recommended way to store the<br>
intermediate results emitted by a bolt, so that when machine crashes, the<b=
r>
results can be replayed to the downstreaming bolts (i.e. bolt B)?<br>
<br>
One possible solution could be that: Don&#39;t keep any intermediate result=
s,<br>
but resort to the storm&#39;s ack framework, so that the raw data will be<b=
r>
replay from spout S when crash happened.<br>
<br>
However, this approach might not be appropriate in my case, as it might<br>
take pretty long time (like a couple of hours) before bolt A has received<b=
r>
all the required data and emit the aggregated results, so that it will be<b=
r>
very expensive for ack framework to keep tracking that many tuples for that=
<br>
long.<br>
<br>
An alternative solution could be: *making bolt A also a spout* and keep the=
<br>
emitted data in a DFS queue. When a result has been acked, the bolt A<br>
removes it from the queue.<br>
<br>
I am wondering if it is reasonable to make a task both bolt and spout at<br=
>
the same time? or if there is any better approach to do so.<br>
<br>
Thank you!<br>
<br>
--<br>
Cheng-Kang Hsieh<br>
UCLA Computer Science PhD Student<br>
M: <a href=3D"tel:%28310%29%20990-4297" target=3D"_blank">(310) 990-4297</a=
><br>
A: 3770 Keystone Ave. Apt 402,<br>
&nbsp; &nbsp; &nbsp;Los Angeles, CA 90034<u></u><u></u></p>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><br>
<br clear=3D"all">
<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">--
<br>
Cheng-Kang Hsieh<br>
UCLA Computer Science PhD Student<br>
M: <a href=3D"tel:%28310%29%20990-4297" target=3D"_blank">(310) 990-4297</a=
><br>
A: 3770 Keystone Ave. Apt 402, <br>
&nbsp; &nbsp; &nbsp;Los Angeles, CA 90034 <u></u><u></u></p>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><br>
<br clear=3D"all">
<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">--
<br>
Cheng-Kang Hsieh<br>
UCLA Computer Science PhD Student<br>
M: <a href=3D"tel:%28310%29%20990-4297" target=3D"_blank">(310) 990-4297</a=
><br>
A: 3770 Keystone Ave. Apt 402, <br>
&nbsp; &nbsp; &nbsp;Los Angeles, CA 90034 <u></u><u></u></p>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal">&nbsp;<u></u><u></u></p>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><br>
<br clear=3D"all">
<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><u></u>&nbsp;<u></u></p>
</div>
<p class=3D"MsoNormal">-- <br>
Cheng-Kang Hsieh<br>
UCLA Computer Science PhD Student<br>
M: <a href=3D"tel:%28310%29%20990-4297" value=3D"+13109904297" target=3D"_b=
lank">(310) 990-4297</a><br>
A: 3770 Keystone Ave. Apt 402, <br>
&nbsp; &nbsp; &nbsp;Los Angeles, CA 90034 <u></u><u></u></p>
</div>
</div>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><br>
<br clear=3D"all">
<br>
-- <u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><b>Abhishek Bhattacharjee</b><u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><b><span style=3D"font-size:7.5pt">Pune Institute of=
 Computer Technology</span></b><u></u><u></u></p>
</div>
</div>
</div>
</div></div></div>
</div>

</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>Cheng-Kang H=
sieh<br>UCLA Computer Science PhD Student<br>M: (310) 990-4297<br>A: 3770 K=
eystone Ave. Apt 402, <br>&nbsp; &nbsp; &nbsp;Los Angeles, CA 90034
</div>

--001a11c383beca7a3104f2378bd5--