Return-Path: X-Original-To: apmail-storm-user-archive@minotaur.apache.org Delivered-To: apmail-storm-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E24F21079E for ; Thu, 9 Jan 2014 15:50:14 +0000 (UTC) Received: (qmail 26622 invoked by uid 500); 9 Jan 2014 15:47:25 -0000 Delivered-To: apmail-storm-user-archive@storm.apache.org Received: (qmail 26436 invoked by uid 500); 9 Jan 2014 15:47:07 -0000 Mailing-List: contact user-help@storm.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@storm.incubator.apache.org Delivered-To: mailing list user@storm.incubator.apache.org Received: (qmail 26347 invoked by uid 99); 9 Jan 2014 15:46:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Jan 2014 15:46:57 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of svend.vanderveken@gmail.com designates 209.85.216.181 as permitted sender) Received: from [209.85.216.181] (HELO mail-qc0-f181.google.com) (209.85.216.181) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Jan 2014 15:46:52 +0000 Received: by mail-qc0-f181.google.com with SMTP id e9so2646350qcy.12 for ; Thu, 09 Jan 2014 07:46:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=pPnUwc3qFNpWgrVNIkWIJpVgZGeqIDTPv2zcODTTNTI=; b=nSawhF+CgH3jUWmtM6/llwNSjEfTjM+WhBLEa9s3JtmTTIdA0Ju9abYCzjQLji9jSP mT5BU8OrQZPqr5qWb7dz0yugufE69CyQ7MP8SaQz8CR3Ci6vcH6F+l6+8CEh6BGdIDyO /sG154wGcA7TFI877fszuILZqhvCCK5Lu8Q2B+lrW/7Zou3oMw98ibJozpCLt+CnNPCY eLanNsybB3qnO3TPaYUarJOI5JuTm9vt3OqQRUFLXAb2Xn+B8uAOQXgBMeWbnsqm7ZAE CF1kLni3LArQCuZpILDN/n1VPFha4r5WhRmVMpEf6IpNAxerBB0lhsTOk0yNZ5zXWrYH kxtw== MIME-Version: 1.0 X-Received: by 10.49.53.100 with SMTP id a4mr8997455qep.44.1389282391403; Thu, 09 Jan 2014 07:46:31 -0800 (PST) Received: by 10.140.41.100 with HTTP; Thu, 9 Jan 2014 07:46:31 -0800 (PST) In-Reply-To: References: Date: Thu, 9 Jan 2014 16:46:31 +0100 Message-ID: Subject: Re: Strom research suggestions From: Svend Vanderveken To: user@storm.incubator.apache.org Content-Type: multipart/alternative; boundary=047d7bd76ef26eb22004ef8b83b5 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bd76ef26eb22004ef8b83b5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hey Tobias, Nice project, I would have loved to play with something like storm back in my university days :) Here's a topic that's been on my mind for a while (Trident API of storm): * one core idea of distributed map reduce =E0 la hadoop was to perform as much processing as possible close to the data: you execute the "map" locally on each node where the data sits, you do a first reduce there, then you let the result travel through the network, you do one last reduce centrally and you have a result without having all your DB travel the network everytime * Storm groupBy + persistentAggregate + reducer/combiner let us have a similar semantic, where we map incoming tuples, reduce them with other tuples in the same group + with previously reduced value stored in DB at regular interval * for each group, the operation above happens always on the same Storm Task (i.e. the same "place" in the cluster) and stores its ongoing state in the "same place" in DB, using the group value as primary key I believe it might be worth investigating if the following pattern would make sense: * install a distributed state store (e..g cassandra) on the same nodes as the Storm workers * try to align the Storm partitioning triggered by the groupby with Cassandra partitioning, so that under usual happy circumstances (no crash), the Storm reduction is happening on the node where Cassandra is storing that particular primary key, avoiding the network travel for the persistence. What do you think? Premature optimization? Does not make sense? Great idea? Let me know :) S On Thu, Jan 9, 2014 at 3:00 PM, Tobias Pazer wrote: > Hi all, > > I have recently started writing my master thesis with a focus on storm, a= s > we are planning to implement the lambda architecture in our university. > > As it's still not very clear for me where exactly it's worth to dive into= , > I was hoping one of you might have any suggestions. > > I was thinking about a benchmark or something else to systematically > evaluate and improve the configuration of storm, but I'm not sure if this > is even worth the time. > > I think the more experienced of you definitely have further ideas! > > Thanks and regards > Tobias > --047d7bd76ef26eb22004ef8b83b5 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hey Tobias,=A0


Nice proj= ect, I would have loved to play with something like storm back in my univer= sity days :)

Here's a topic that's been on= my mind for a while (Trident API of storm):


* one core idea of distributed map reduc= e =E0 la hadoop was to perform as much processing as possible close to the = data: you execute the "map" locally on each node where the data s= its, you do a first reduce there, then you let the result travel through th= e network, you do one last reduce centrally and you have a result without h= aving all your DB travel the network everytime=A0

* Storm groupBy + persistentAggregate + reducer/combine= r let us have a similar semantic, where we map incoming tuples, reduce them= with other tuples in the same group + with previously reduced value stored= in DB at regular interval=A0

* for each group, the operation above happens always on= the same Storm Task (i.e. the same "place" in the cluster) and s= tores its ongoing state in the "same place" in DB, using the grou= p value as primary key=A0

I believe it might be worth investigating if the follow= ing pattern would make sense:=A0

* install a distr= ibuted state store (e..g cassandra) on the same nodes as the Storm workers<= /div>

* try to align the Storm partitioning triggered by the = groupby with Cassandra partitioning, so that under usual happy circumstance= s (no crash), the Storm reduction is happening on the node where Cassandra = is storing that particular primary key, avoiding the network travel for the= persistence.=A0


What do you think? Premature optimizatio= n? Does not make sense? Great idea? Let me know :)


S




On Thu, Jan 9, 2014 at 3:00 PM, Tobias P= azer <tobiaspazer@gmail.com> wrote:

Hi all,

I have recently started writing my master thesis with a focu= s on storm, as we are planning to implement the lambda architecture in our = university.

As it's still not very clear for me where exactly it'= ;s worth to dive into, I was hoping one of you might have any suggestions. =

I was thinking about a benchmark or something else to system= atically evaluate and improve the configuration of storm, but I'm not s= ure if this is even worth the time.

I think the more experienced of you definitely have further = ideas!

Thanks and regards
Tobias


--047d7bd76ef26eb22004ef8b83b5--