storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Mocanu <amoc...@verticalscope.com>
Subject RE: Svend's blog - several questions
Date Fri, 07 Feb 2014 01:08:24 GMT
Thanks for pointing it out!

That test file is extremely helpful!



Hi all,

I'm using  storm-cassandra (mentioned in this thread). I keep getting NPE:



10241 [Thread-21] ERROR com.netflix.astyanax.connectionpool.impl.CountingConnect

ionPoolMonitor  - com.netflix.astyanax.connectionpool.exceptions.UnknownExceptio

n: UnknownException: [host=10.10.6.80(10.10.6.80):9160, latency=23(24), attempts

=1]java.lang.NullPointerException

com.netflix.astyanax.connectionpool.exceptions.UnknownException: UnknownExceptio

n: [host=10.10.6.80(10.10.6.80):9160, latency=23(24), attempts=1]java.lang.NullP

ointerException



The code I use is the following (for local not cluster run):



val options = new chat.CassandraMapState.Options[TransactionalValue[_]]();

        options.columnFamily = "transactional";

      val clusterContext = chat.AstyanaxUtil.newClusterContext("10.10.6.80:9160");

      chat.AstyanaxUtil.createColumnFamily(clusterContext, "test", "transactional", "UTF8Type",
"UTF8Type", "UTF8Type");



val cassandraStateFactory:StateFactory = CassandraMapState.transactional(options)



    val spout = new FixedBatchSpout(new Fields("sentence"), 3,

      new Values("the cow jumped over the moon"),

      new Values("the man went to the store and bought some candy"),

      new Values("four score and seven years ago"),

      new Values("how many apples can you eat"))

    spout.setCycle(true)



    val wordCounts :TridentState= tridentBuilder.newStream("spout1", spout)

      .each(new Fields("sentence"), new Split(), new Fields("word"))

      .groupBy(new Fields("word"))

      .persistentAggregate(cassandraStateFactory, new Count(), new Fields("count"))



    val cluster = new LocalCluster();

    val config = new Config();

    val clientConfig = new util.HashMap[String, Object]();

    clientConfig.put(StormCassandraConstants.CASSANDRA_HOST, "10.10.6.80:9160");

    clientConfig.put(StormCassandraConstants.CASSANDRA_STATE_KEYSPACE, "test");

    config.put("cassandra.config", clientConfig);  //must match the KEY from Options.clientConfigKey

    config.setMaxSpoutPending(100);

    config.setMaxSpoutPending(25);

    cluster.submitTopology("test", config, tridentBuilder.build());



This code creates keyspace test and column family transactional with a few other things.

So connection to Cassandra is fine.

The code that throws the can exception is in multiGet method:

        RowSliceQuery<Composite, String> query = this.keyspace.prepareQuery(cf).getKeySlice(keyNames);

        Rows<Composite, String> result = null;

        try {

            result = query.execute().getResult();         <---------this gives NPE

        } catch (ConnectionException e) {

            // TODO throw a specific error.

            throw new RuntimeException(e);             <----caught here

        }



The keys passed to multiGet are: [[moon], [bought], [the], [some], [score], [cow

], [went], [and], [to], [seven], [over], [store], [years], [jumped], [candy], [f

our], [ago], [man]]



Is this exception thrown because of schema mismatch? IF so, what column names/schema am I
supposed to use to run this example code.



Thanks so much.

I'm getting close!!



-A
From: P. Taylor Goetz [mailto:ptgoetz@gmail.com]
Sent: February-06-14 2:10 PM
To: user@storm.incubator.apache.org
Subject: Re: Svend's blog - several questions

It's in maven central:


         <groupId>com.hmsonline</groupId>

         <artifactId>storm-cassandra</artifactId>

         <version>0.4.0-rc4</version>

- Taylor


On Feb 6, 2014, at 2:05 PM, Adrian Mocanu <amocanu@verticalscope.com<mailto:amocanu@verticalscope.com>>
wrote:


Hi Taylor,
I will give this a try. What is the maven repository for it?

I've found several ones:
"com.hmsonline" % "hms-cassandra-rest" % "1.0.0"
"com.github.ptgoetz" % "storm-cassandra" % "0.1.2"

And now
"com.hmsonline" % "storm-cassandra" % "0.4.0-rc4"
from
http://mvnrepository.com/artifact/com.hmsonline/storm-cassandra

-A
From: P. Taylor Goetz [mailto:ptgoetz@gmail.com]
Sent: February-06-14 11:31 AM
To: user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org>
Subject: Re: Svend's blog - several questions

Thanks Svend. Good explanation.

Adrian,

The storm-cassandra documentation could be better in terms of explaining how to use the MapState
implementation, but theres a unit test that demonstrates basic usage:

https://github.com/hmsonline/storm-cassandra/blob/master/src/test/java/com/hmsonline/storm/cassandra/bolt/CassandraMapStateTest.java

Basically, you just need to point it to a keyspace + column family where the state data will
be stored.

- Taylor

On Feb 6, 2014, at 3:25 AM, Svend Vanderveken <svend.vanderveken@gmail.com<mailto:svend.vanderveken@gmail.com>>
wrote:




The logic of a map state is to keep a "state" somewhere, you can think of a Storm state as
a big Map of key values, the keys come from the groupBy and the values are the result of the
aggregations. Conceptually, when your topology is talking to a State, you can imagine it's
actually talking to a big HashMap (only there's a DB behind for persistence + opaque logic
for error handling).

Most of the time, I try not to have any other part of my product that actually depends on
the location or structure the data is stored in DB, so I do not really need to be super specific
about the storage stucture: that is up to the IBackingMap implementation I am delegating to.
Read or write access to the DB is done via the Storm primitive, not by accessing the DB directly.
Don't forget there's also the stateQuery primitive you can use to read you stored state from
another place.

There are ways to configure column families and column names, have a look at the super clear
storm-cassandra doc to see how to do that with this implementation:https://github.com/hmsonline/storm-cassandra

My blog post of last year is indeed illustrating a full implementation including an in-house
IBackingMap  implementation, I think that approach is sometimes needed when we want fine grained
control over things. I should have made more clear that this is not necessarily the default
approach to have.


I hope this makes sense now.

S











On Wed, Feb 5, 2014 at 11:15 PM, Adrian Mocanu <amocanu@verticalscope.com<mailto:amocanu@verticalscope.com>>
wrote:
Thank you Svend and Adam.

Svend I'm your reader and that tutorial is very useful. I've been spending lots of time looking
at the code and that blog post.

BTW I initially thought you were adding the nulls incorrectly in Q3 below, but now I see you're
doing it correctly.

I have a follow up question:
Why do you say that "we do not implement multiget/multiput, we just take an existing implementation
for Cassandra or Memcached or anything and they do what's right for that backend."
I thought that I had to rewrite an IBackingMap implementation to correspond to the tuples
and schema I have in my database. I use Cassandra.
I started with com.hmsonline.storm.cassandra.trident.CassandraState or trident.cassandra.CassandraState
(they both implement IBackingMap) and I replaced multiGet and multiPut to match my db schema.
(well, I'm trying to do it)

You are saying I can use CassandraState as it is? :D
If so how would it even know what table my data should go into? It allows you to set the column
family and a few other things where state will be saved (keyspace, column family, replication,
rowKey). By state I think it means sth like txID (transaction ID). Do you by any chance know
what this state that CassandraState is saving is?
So as you can tell I have no idea how to use CassandraState.

Thanks again!
-A
From: Svend Vanderveken [mailto:svend.vanderveken@gmail.com<mailto:svend.vanderveken@gmail.com>]
Sent: February-05-14 2:56 PM

To: user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org>
Subject: Re: Svend's blog - several questions




On Wed, Feb 5, 2014 at 6:22 PM, Adrian Mocanu <amocanu@verticalscope.com<mailto:amocanu@verticalscope.com>>
wrote:
I've read Svend's blog [http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/]
multiple times and I have a few questions.


So you are my reader! Great :D
(you can post your questions on the blog itself, I'm more likely to spot it there)



"Because we did a groupBy on one tuple field, each List contains here one single
String: the correlationId. Note that the list we return must have exactly the same
size as the list of keys, so that Storm knows what period corresponds to what key.
So for any key that does not exist in DB, we simply put a null in the resulting list."

Q1: Do the db keys come only from groupBy?

Yes, the key values arriving in the multiget are the field value by which we are grouping
do  groupBy (new Fields("color")) and you get things like "blue"; "green", "flowerly romantic
red"...




Q2: Can you do groupBy multiple keys:like .groupBy("name").groupBy("id") ?

yes, the syntax is like this:

groupBy (new Fields("name", "id"))

That's the reason the keys in the multiget are List<Object> and not simply Object. We
receive them in the order they specified in the topology definition


Q3: When we add null we keep the size of the results list the same as they keys list but I
don't understand how we make sure that key(3) points to correct result(3).
After all we're adding nulls at the end of result list not intermitently. ie: if
key(1) does not have an entry in db, and key size is 5, we add null to last position
in results not to results(1). This doesn't preserve consistency/order so key(1) now
gives result(1) which is not null as it should be. Is the code incorrect ... or the
explanation on Svend's blog is incorrect?


The order should indeed be respected, so if the strategy to handling error DB error in a multi-get
is to put nulls, that they should indeed be at index corresponding to the problematic key.
Is there part of my toy project code who is padding nulls at the end? If so that's indeed
a bug, please let me know where (or better, fork and send me a pull request)

Note that I'm not particularly recommending to put nulls in case of unrecoverable errors in
a multi-get, that's actually a simplistic way of handling the error. The contract with storm
is either to fail either to return a list of the correct size in the correct order. The data
itself and its semantic is up to the topology implementation, i.e. up to us.




Moving on,
"Once this is loaded Storm will present the tuples having the same correlation ID
one by one to our reducer, the PeriodBuilder"

Q4: Does Trident/Storm call the reducer after calling multiGet and before calling multiPut?

yes

Q5: What params (and their types) are passed to the reducer and what parameters should it
emit so they can go into multiGet?

the reducer is called iteratively, it starts with the state found from DB (returned by the
multiget) and the first grouped tuple, then the second, then the third... until the last tuple.
The return value of the last call of the reducer is what is provided to the multiput, for
the same key as the multiget.

"reduce" is actually a very common pattern in functional programming, which us java programming
are sometimes less aware of. Look up some general doc on "reduce", the storm approach to it
is very traditional, i.e. Storm has defined the "reduce" primitive exactly the way many other
tools are defining that primitive


Q6: The first time the program is run the database is empty and multiGet will return nothing.
Does the reducer need to take care and make sure to insert for the first time as opposed to
update value? I do see that reducer (TimelineUpdater) checks for nulls and I'm guessing this
is the reason why it does so.



Exactly.

That's also why returning null in case of error in the multiget is questionable and probably
not what you would systematically do: it is equivalent to saying: there's garbage in persistence
for that key, so let's just consider there's nothing. The actually proper thing to do depends
on the task at hand, but actually, such error in multiget is ofter a symptom that we stored
garbage in persistence in the past due to some other, it's too late to correct it now.

Last thing: most of the time we do not implement multiget/multiput, we just take an existing
implementation for Cassandra or Memcached or anything and they do what's right for that backend.


Q7:
Can someone explain what these mean:
.each  (I've seen this used even consecutively: .each(..).each(..) )
.newStream
.newValuesStream
.persistAggregate


I think they are all detailed here: https://github.com/nathanmarz/storm/wiki/Trident-API-Overview


I am unable to find javadocs with documentation for the method signatures.
These java docs don't help much:http://nathanmarz.github.io/storm/doc/storm/trident/Stream.html


Q8:
Storm has ack/fail; does Trident handle that automatically?


Yes, although you can also explicitly trigger error. Look up my next blog:  error handling
in Storm Trident.




Q9: Has anyone tried Spark? http://spark.incubator.apache.org/streaming/
I'm wondering if anyone has tried it because I'm thinking of ditching storm and moving to
that.
It seems much much much better documented.


Spark looks cool  I've not played with it yet, no. Go ahead, keep us posted what you find
out!



Lots of questions I know. Thanks for reading!


and you :D


-Adrian



Svend


Mime
View raw message