Agree w/ Svend.  That use case is a good one where Cassandra is used for output.

I’d also suggest that you tackle a use case that uses a ColumnFamily as input.  
(perhaps a Kafka queue of row/partition keys)

Then, use Svend’s suggestion to route the keys to the machines that host the data.

-brian

---

Brian O'Neill

Chief Architect

Health Market Science

The Science of Better Results

2700 Horizon Drive  King of Prussia, PA  19406

M: 215.588.6024 @boneill42    

healthmarketscience.com


This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited.

 


From: Svend Vanderveken <svend.vanderveken@gmail.com>
Reply-To: <user@storm.incubator.apache.org>
Date: Friday, January 10, 2014 at 6:39 AM
To: <user@storm.incubator.apache.org>
Subject: Re: Strom research suggestions

Hi, 

Cool, I hope this thing can get started then. Based the comments from Brian, Adam, Klausen and Michael 's that I was happy to read, I feel I would not be the only one willing to share ideas and/or code about that :D

I guess a starting point would be to dig into the details of the available strategies for partitioning data in Cassandra: 

http://www.datastax.com/docs/1.0/cluster_architecture/partitioning


Then imagine you have a bunch of Storm tuples coming in in real time, including, say, a geo-localization, and you want to regroup all the events happening in the same "locationId" (e.g postal code, or rounded latitude/longitude, whatever...) in order to have some counters for each such group. Storm is going to partition the processing of all those tuples across its cluster, so the idea is to tell Storm to do so in the same fashion as Cassandra is partitioning the storage (the counters for each locationId). Hmmm, maybe it's as simple as adding a tuple field that contains the result of the Cassandra partitioner (by querying it or including the logic as a functionality of the topology) and do a partitionBy() on that. Actually, as I understand it groupBy()+PersistentAggregate is built based on a simple partitionBy+partitionAggregate, so code-wise this whole thing might not be huge. Go ahead, that sounds cool!

Cheers, 

Svend






On Thu, Jan 9, 2014 at 9:53 PM, Tobias Pazer <tobiaspazer@gmail.com> wrote:
This is exactly what I was looking for, as I am reading a lot about Hadoop at the same time. Haven't got any experience with partitioning alignment so far, so I would appreciate any suggestions on how to approach this topic efficiently. But this shouldn't be a problem as I still have until October...

Now I just have to convince my academic advisor.

Thanks so far I think this topic is definitly worth to look into.




2014/1/9 Michael Oczkowski <Michael.Oczkowski@seeq.com>

+1 for this idea.  I heard DataStax was investigating Storm integration (like they do with Hadoop) but so far as I know this isn’t going to happen.  The need for push-down analytics is great and a very general problem and any nice solution would help many people!

 

Also to Brian’s point it would be great to use Storm in lieu of Hadoop if it’s performant.

 

From: supercargo@gmail.com [mailto:supercargo@gmail.com] On Behalf Of Adam Lewis
Sent: Thursday, January 9, 2014 9:11 AM
To: user


Subject: Re: Strom research suggestions

 

I love it; even if it is a premature optimization the beauty of academic work is that this should be measurable and is still an interesting finding either way.  I don't have the large scale production experience with storm that others here have (yet), but it sounds like it would really help performance since you're going after network transfer.  And as you say, Svend, all the ingredients are already built in to trident.

 

Adam

 

On Thu, Jan 9, 2014 at 10:56 AM, Brian O'Neill <bone@alumni.brown.edu> wrote:

 

+1, love the idea.  I’ve wanted to play with partitioning alignment myself (with C*), but i’ve been too busy with the day job. =)

 

Tobias, if you need some support — don’t hesitate to reach out.

 

If you are able to align the partitioning, and we can add “in-place” computation within Storm, it would be great to see a speed comparison between Hadoop and Storm.   (If comparable, it may drive people to abandon their Hadoop infrastructure for batch processing, and run everything on Storm)

 

-brian

 

---

Brian O'Neill

Chief Architect

Health Market Science

The Science of Better Results

2700 Horizon Drive  King of Prussia, PA  19406

M: 215.588.6024 @boneill42    

healthmarketscience.com

 

This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited.

 

 

From: Svend Vanderveken <svend.vanderveken@gmail.com>
Reply-To: <
user@storm.incubator.apache.org>
Date: Thursday, January 9, 2014 at 10:46 AM
To: <
user@storm.incubator.apache.org>
Subject: Re: Strom research suggestions

 

Hey Tobias, 

 

 

Nice project, I would have loved to play with something like storm back in my university days :)

 

Here's a topic that's been on my mind for a while (Trident API of storm):

 

 

* one core idea of distributed map reduce la hadoop was to perform as much processing as possible close to the data: you execute the "map" locally on each node where the data sits, you do a first reduce there, then you let the result travel through the network, you do one last reduce centrally and you have a result without having all your DB travel the network everytime 

 

* Storm groupBy + persistentAggregate + reducer/combiner let us have a similar semantic, where we map incoming tuples, reduce them with other tuples in the same group + with previously reduced value stored in DB at regular interval 

 

* for each group, the operation above happens always on the same Storm Task (i.e. the same "place" in the cluster) and stores its ongoing state in the "same place" in DB, using the group value as primary key 

 

I believe it might be worth investigating if the following pattern would make sense: 

 

* install a distributed state store (e..g cassandra) on the same nodes as the Storm workers

 

* try to align the Storm partitioning triggered by the groupby with Cassandra partitioning, so that under usual happy circumstances (no crash), the Storm reduction is happening on the node where Cassandra is storing that particular primary key, avoiding the network travel for the persistence. 

 

 

What do you think? Premature optimization? Does not make sense? Great idea? Let me know :)

 

 

S

 

 

 

On Thu, Jan 9, 2014 at 3:00 PM, Tobias Pazer <tobiaspazer@gmail.com> wrote:

Hi all,

I have recently started writing my master thesis with a focus on storm, as we are planning to implement the lambda architecture in our university.

As it's still not very clear for me where exactly it's worth to dive into, I was hoping one of you might have any suggestions.

I was thinking about a benchmark or something else to systematically evaluate and improve the configuration of storm, but I'm not sure if this is even worth the time.

I think the more experienced of you definitely have further ideas!

Thanks and regards
Tobias