storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Pazer <tobiaspa...@gmail.com>
Subject Re: Strom research suggestions
Date Thu, 09 Jan 2014 20:53:43 GMT
This is exactly what I was looking for, as I am reading a lot about Hadoop
at the same time. Haven't got any experience with partitioning alignment so
far, so I would appreciate any suggestions on how to approach this topic
efficiently. But this shouldn't be a problem as I still have until
October...

Now I just have to convince my academic advisor.

Thanks so far I think this topic is definitly worth to look into.




2014/1/9 Michael Oczkowski <Michael.Oczkowski@seeq.com>

>  +1 for this idea.  I heard DataStax was investigating Storm integration
> (like they do with Hadoop) but so far as I know this isn’t going to
> happen.  The need for push-down analytics is great and a very general
> problem and any nice solution would help many people!
>
>
>
> Also to Brian’s point it would be great to use Storm in lieu of Hadoop if
> it’s performant.
>
>
>
> *From:* supercargo@gmail.com [mailto:supercargo@gmail.com] *On Behalf Of *Adam
> Lewis
> *Sent:* Thursday, January 9, 2014 9:11 AM
> *To:* user
>
> *Subject:* Re: Strom research suggestions
>
>
>
> I love it; even if it is a premature optimization the beauty of academic
> work is that this should be measurable and is still an interesting finding
> either way.  I don't have the large scale production experience with storm
> that others here have (yet), but it sounds like it would really help
> performance since you're going after network transfer.  And as you say,
> Svend, all the ingredients are already built in to trident.
>
>
>
> Adam
>
>
>
> On Thu, Jan 9, 2014 at 10:56 AM, Brian O'Neill <bone@alumni.brown.edu>
> wrote:
>
>
>
> +1, love the idea.  I’ve wanted to play with partitioning alignment myself
> (with C*), but i’ve been too busy with the day job. =)
>
>
>
> Tobias, if you need some support — don’t hesitate to reach out.
>
>
>
> If you are able to align the partitioning, and we can add “in-place”
> computation within Storm, it would be great to see a speed comparison
> between Hadoop and Storm.   (If comparable, it may drive people to abandon
> their Hadoop infrastructure for batch processing, and run everything on
> Storm)
>
>
>
> -brian
>
>
>
> ---
>
> Brian O'Neill
>
> Chief Architect
>
> *Health Market Science*
>
> *The Science of Better Results*
>
> 2700 Horizon Drive • King of Prussia, PA • 19406
>
> M: 215.588.6024 • @boneill42 <http://www.twitter.com/boneill42>  •
>
> healthmarketscience.com
>
>
>
> This information transmitted in this email message is for the intended
> recipient only and may contain confidential and/or privileged material. If
> you received this email in error and are not the intended recipient, or the
> person responsible to deliver it to the intended recipient, please contact
> the sender at the email above and delete this email and any attachments and
> destroy any copies thereof. Any review, retransmission, dissemination,
> copying or other use of, or taking any action in reliance upon, this
> information by persons or entities other than the intended recipient is
> strictly prohibited.
>
>
>
>
>
> *From: *Svend Vanderveken <svend.vanderveken@gmail.com>
> *Reply-To: *<user@storm.incubator.apache.org>
> *Date: *Thursday, January 9, 2014 at 10:46 AM
> *To: *<user@storm.incubator.apache.org>
> *Subject: *Re: Strom research suggestions
>
>
>
> Hey Tobias,
>
>
>
>
>
> Nice project, I would have loved to play with something like storm back in
> my university days :)
>
>
>
> Here's a topic that's been on my mind for a while (Trident API of storm):
>
>
>
>
>
> * one core idea of distributed map reduce à la hadoop was to perform as
> much processing as possible close to the data: you execute the "map"
> locally on each node where the data sits, you do a first reduce there, then
> you let the result travel through the network, you do one last reduce
> centrally and you have a result without having all your DB travel the
> network everytime
>
>
>
> * Storm groupBy + persistentAggregate + reducer/combiner let us have a
> similar semantic, where we map incoming tuples, reduce them with other
> tuples in the same group + with previously reduced value stored in DB at
> regular interval
>
>
>
> * for each group, the operation above happens always on the same Storm
> Task (i.e. the same "place" in the cluster) and stores its ongoing state in
> the "same place" in DB, using the group value as primary key
>
>
>
> I believe it might be worth investigating if the following pattern would
> make sense:
>
>
>
> * install a distributed state store (e..g cassandra) on the same nodes as
> the Storm workers
>
>
>
> * try to align the Storm partitioning triggered by the groupby with
> Cassandra partitioning, so that under usual happy circumstances (no crash),
> the Storm reduction is happening on the node where Cassandra is storing
> that particular primary key, avoiding the network travel for the
> persistence.
>
>
>
>
>
> What do you think? Premature optimization? Does not make sense? Great
> idea? Let me know :)
>
>
>
>
>
> S
>
>
>
>
>
>
>
> On Thu, Jan 9, 2014 at 3:00 PM, Tobias Pazer <tobiaspazer@gmail.com>
> wrote:
>
> Hi all,
>
> I have recently started writing my master thesis with a focus on storm, as
> we are planning to implement the lambda architecture in our university.
>
> As it's still not very clear for me where exactly it's worth to dive into,
> I was hoping one of you might have any suggestions.
>
> I was thinking about a benchmark or something else to systematically
> evaluate and improve the configuration of storm, but I'm not sure if this
> is even worth the time.
>
> I think the more experienced of you definitely have further ideas!
>
> Thanks and regards
> Tobias
>
>
>
>
>

Mime
View raw message