storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Svend Vanderveken <svend.vanderve...@gmail.com>
Subject Re: Svend's blog - several questions
Date Wed, 05 Feb 2014 19:55:54 GMT
On Wed, Feb 5, 2014 at 6:22 PM, Adrian Mocanu <amocanu@verticalscope.com>wrote:

>  I've read Svend's blog [
> http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/]
> multiple times and I have a few questions.
>
>
>

So you are my reader! Great :D
(you can post your questions on the blog itself, I'm more likely to spot it
there)



>
>
> "Because we did a groupBy on one tuple field, each List contains here one
> single
>
> String: the correlationId. Note that the list we return must have exactly
> the same
>
> size as the list of keys, so that Storm knows what period corresponds to
> what key.
>
> So for any key that does not exist in DB, we simply put a null in the
> resulting list."
>
>
>
> Q1: Do the db keys come only from groupBy?
>

Yes, the key values arriving in the multiget are the field value by which
we are grouping
do  groupBy (new Fields("color")) and you get things like "blue"; "green",
"flowerly romantic red"...





>  Q2: Can you do groupBy multiple keys:like .groupBy("name").groupBy("id")
> ?
>

yes, the syntax is like this:

groupBy (new Fields("name", "id"))

That's the reason the keys in the multiget are List<Object> and not simply
Object. We receive them in the order they specified in the topology
definition



>  Q3: When we add null we keep the size of the results list the same as
> they keys list but I don't understand how we make sure that key(3) points
> to correct result(3).
>
>  After all we're adding nulls at the end of result list not
> intermitently. ie: if
>
> key(1) does not have an entry in db, and key size is 5, we add null to
> last position
>
> in results not to results(1). This doesn't preserve consistency/order so
> key(1) now
>
> gives result(1) which is not null as it should be. Is the code incorrect
> ... or the
>
> explanation on Svend's blog is incorrect?
>


The order should indeed be respected, so if the strategy to handling error
DB error in a multi-get is to put nulls, that they should indeed be at
index corresponding to the problematic key. Is there part of my toy project
code who is padding nulls at the end? If so that's indeed a bug, please let
me know where (or better, fork and send me a pull request)

Note that I'm not particularly recommending to put nulls in case of
unrecoverable errors in a multi-get, that's actually a simplistic way of
handling the error. The contract with storm is either to fail either to
return a list of the correct size in the correct order. The data itself and
its semantic is up to the topology implementation, i.e. up to us.



>
>
>
>
> Moving on,
>
> "Once this is loaded Storm will present the tuples having the same
> correlation ID
>
> one by one to our reducer, the PeriodBuilder"
>
>
>
> Q4: Does Trident/Storm call the reducer after calling multiGet and before
> calling multiPut?
>

yes


>  Q5: What params (and their types) are passed to the reducer and what
> parameters should it emit so they can go into multiGet?
>

the reducer is called iteratively, it starts with the state found from DB
(returned by the multiget) and the first grouped tuple, then the second,
then the third... until the last tuple. The return value of the last call
of the reducer is what is provided to the multiput, for the same key as the
multiget.

"reduce" is actually a very common pattern in functional programming, which
us java programming are sometimes less aware of. Look up some general doc
on "reduce", the storm approach to it is very traditional, i.e. Storm has
defined the "reduce" primitive exactly the way many other tools are
defining that primitive


>
>
> Q6: The first time the program is run the database is empty and multiGet
> will return nothing.
>
> Does the reducer need to take care and make sure to insert for the first
> time as opposed to update value? I do see that reducer (TimelineUpdater)
> checks for nulls and I'm guessing this is the reason why it does so.
>
>
>
>
>

Exactly.

That's also why returning null in case of error in the multiget is
questionable and probably not what you would systematically do: it is
equivalent to saying: there's garbage in persistence for that key, so let's
just consider there's nothing. The actually proper thing to do depends on
the task at hand, but actually, such error in multiget is ofter a symptom
that we stored garbage in persistence in the past due to some other, it's
too late to correct it now.

Last thing: most of the time we do not implement multiget/multiput, we just
take an existing implementation for Cassandra or Memcached or anything and
they do what's right for that backend.



>  Q7:
>
> Can someone explain what these mean:
>
> .each  (I've seen this used even consecutively: .each(..).each(..) )
>
> .newStream
>
> .newValuesStream
>
> .persistAggregate
>


I think they are all detailed here:
https://github.com/nathanmarz/storm/wiki/Trident-API-Overview


>
>
> I am unable to find javadocs with documentation for the method signatures.
>
> These java docs don't help much:
> http://nathanmarz.github.io/storm/doc/storm/trident/Stream.html
>
>
>
>
>
> Q8:
>
> Storm has ack/fail; does Trident handle that automatically?
>


Yes, although you can also explicitly trigger error. Look up my next blog:
 error handling in Storm Trident.



>
>
>
>
> Q9: Has anyone tried Spark? http://spark.incubator.apache.org/streaming/
>
> I'm wondering if anyone has tried it because I'm thinking of ditching
> storm and moving to that.
>
> It seems much much much better documented.
>
>
>

Spark looks cool  I've not played with it yet, no. Go ahead, keep us posted
what you find out!



>
>
> Lots of questions I know. Thanks for reading!
>


and you :D


>
>
> -Adrian
>
>
>


Svend

Mime
View raw message