storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Laurent Thoulon <laurent.thou...@ldmobile.net>
Subject hmsonline/storm-cassandra
Date Fri, 03 Jan 2014 09:27:14 GMT
Hi, 

We've been using Cassandra in out topologies for some time now. When we started, there was
not CassandraState that suited our needs so we basically reinvented the wheel based on an
old CassandraState that used Hector. 
What we implemented is for the CassandraMapState to use dynamic column family names, rowkeys
and column names and the ability to use Composites. By dynamic i mean it can be fetched from
the tuple. 
It works nicely but we've been seeing some performance issues when scaling and we're thinking
it may be coming from hector's batch mutations. 

I'm not going to go through all our thoughts but we also decided to rebuild our topologies
to make them smaller and with fewer goals so we can be able to pinpoint the bottlenecks more
easely. 

Just so everything is said, we're using Trident. 

Now, we're considering using Astyanax and so we thought it may be a good idea to try and use
hmsonline/storm-cassandra as it's part of storm's contrib. We've successfully implemented
a basic use case but we're now facing some more complexe ones. Our main problem is that the
CassandraMapState seems to restrain us to a particuliar schema for the CFs : keys beeing composites
and column name, colum family and ttl are fixed in the options. Those reason are the same
kind that lead us in the first place to refactor the CassandraMapState. We're actually surprised
noone seems to have had the same needs and we're thinking there may be a better approach to
what we want to do that we did not think of. 

We have two kinds of topologies we're building: 
- Topologies that stores counters in an opaque way in various column families (for various
grainularities) using rowkeys that can be composite or not and dynamic column names (timestamps
or composites made of ids and timestamps depending on the current tuple) 
- Topologies that stores in a non transactionnal way a hashmap of <column name, column
values> in a rowkey depending on the tuple. 

Does anyone have the same needs ? 
Would you have any advice on how to achieve our goals in the most efficient way ? 
Should we just use our own CassandraState and move it to Astyanax ? 
We'd be glad to talk about this and share our knowledge with the community. 

If you'd like to see what we've done with our homebrewed CassandraState, i created this Gist:

https://gist.github.com/Crystark/aca10845fb31f75e9b41 

Here's what a partitionPersist looks like: 

.partitionPersist( 
getCassandraState(), 
new Fields("timestamp", "e", "a", "c", "r", "count"), 
new CassandraMultiputUpdater(CfStats.CF, new Fields("a", "c", "r", "e"), new Fields("timestamp"),
new Fields("count"), CfStats.TTL) 
) 

And what a stateQuery looks like: 

.stateQuery( 
topology.newStaticState(getCassandraState()), 
new Fields("a", "c"), 
new CassandraMapGet(CfUser.CF, new Fields("a", "c")), 
new Fields("mapWithOneResult") // config in getCassandraState sets a limit to 1 and a range
on columns for CfUser.CF 
) 

Here's some versioning: 
- Java 6 
- Kafka 0.7 
- Storm 0.9.0-wip16 
- Cassandra 1.2.4 
We're considering upgrading all those to 7 / 0.8 / 0.9 / 2. 

Thanks 
Regards 
Laurent 



Mime
View raw message