Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C9AC118AAE for ; Wed, 18 Nov 2015 10:14:38 +0000 (UTC) Received: (qmail 74585 invoked by uid 500); 18 Nov 2015 10:14:38 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 74503 invoked by uid 500); 18 Nov 2015 10:14:38 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 74494 invoked by uid 99); 18 Nov 2015 10:14:38 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Nov 2015 10:14:38 +0000 Received: from mail-lf0-f49.google.com (mail-lf0-f49.google.com [209.85.215.49]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id BC8861A006D for ; Wed, 18 Nov 2015 10:14:37 +0000 (UTC) Received: by lfdo63 with SMTP id o63so23074347lfd.2 for ; Wed, 18 Nov 2015 02:14:36 -0800 (PST) X-Received: by 10.25.64.5 with SMTP id n5mr273429lfa.18.1447841676122; Wed, 18 Nov 2015 02:14:36 -0800 (PST) MIME-Version: 1.0 Received: by 10.112.72.227 with HTTP; Wed, 18 Nov 2015 02:14:16 -0800 (PST) In-Reply-To: References: <0DB9D25D-33AF-4F03-A2F4-FBF3F6B2013E@inria.fr> From: Robert Metzger Date: Wed, 18 Nov 2015 11:14:16 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Creating a representative streaming workload To: "user@flink.apache.org" Content-Type: multipart/alternative; boundary=001a113eb83ecbe4380524cde87f --001a113eb83ecbe4380524cde87f Content-Type: text/plain; charset=UTF-8 Hey Vasia, I think a very common workload would be an event stream from web servers of an online shop. Usually, these shops have multiple servers, so events arrive out of order. I think there are plenty of different use cases that you can build around that data: - Users perform different actions that a streaming system could track (analysis of click-paths), - some simple statistics using windows (items sold in the last 10 minutes, ..). - Maybe fraud detection would be another use case. - Often, there also needs to be a sink to HDFS or another file system for a long-term archive. I would love to see such an event generator in flink's contrib module. I think that's something the entire streaming space could use. On Mon, Nov 16, 2015 at 8:22 PM, Nick Dimiduk wrote: > All those should apply for streaming too... > > On Mon, Nov 16, 2015 at 11:06 AM, Vasiliki Kalavri < > vasilikikalavri@gmail.com> wrote: > >> Hi, >> >> thanks Nick and Ovidiu for the links! >> >> Just to clarify, we're not looking into creating a generic streaming >> benchmark. We have quite limited time and resources for this project. What >> we want is to decide on a set of 3-4 _common_ streaming applications. To >> give you an idea, for the batch workload, we will pick something like a >> grep, one relational application, a graph algorithm, and an ML algorithm. >> >> Cheers, >> -Vasia. >> >> On 16 November 2015 at 19:25, Ovidiu-Cristian MARCU < >> ovidiu-cristian.marcu@inria.fr> wrote: >> >>> Regarding Flink vs Spark / Storm you can check here: >>> http://www.sparkbigdata.com/102-spark-blog-slim-baltagi/14-results-of-a-benchmark-between-apache-flink-and-apache-spark >>> >>> Best regards, >>> Ovidiu >>> >>> On 16 Nov 2015, at 15:21, Vasiliki Kalavri >>> wrote: >>> >>> Hello squirrels, >>> >>> with some colleagues and students here at KTH, we have started 2 >>> projects to evaluate (1) performance and (2) behavior in the presence of >>> memory interference in cloud environments, for Flink and other systems. We >>> want to provide our students with a workload of representative applications >>> for testing. >>> >>> While for batch applications, it is quite clear to us what classes of >>> applications are widely used and how to create a workload of different >>> types of applications, we are not quite sure about the streaming workload. >>> >>> That's why, we'd like your opinions! If you're using Flink streaming in >>> your company or your project, we'd love your input even more :-) >>> >>> What kind of applications would you consider as "representative" of a >>> streaming workload? Have you run any experiments to evaluate Flink versus >>> Spark, Storm etc.? If yes, would you mind sharing your code with us? >>> >>> We will of course be happy to share our results with everyone after we >>> have completed our study. >>> >>> Thanks a lot! >>> -Vasia. >>> >>> >>> >> > --001a113eb83ecbe4380524cde87f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hey Vasia,

I think a very common worklo= ad would be an event stream from web servers of an online shop. Usually, th= ese shops have multiple servers, so events arrive out of order.
I= think there are plenty of different use cases that you can build around th= at data:
- Users perform different actions that a streaming syste= m could track (analysis of click-paths),=C2=A0
- some simple stat= istics using windows (items sold in the last 10 minutes, ..).=C2=A0
- Maybe fraud detection would be another use case.
- Often, th= ere also needs to be a sink to HDFS or another file system for a long-term = archive.

I would love to see such an event generat= or in flink's contrib module. I think that's something the entire s= treaming space could use.




On Mon, No= v 16, 2015 at 8:22 PM, Nick Dimiduk <ndimiduk@gmail.com> wr= ote:
All those should ap= ply for streaming too...

On Mon, Nov 16, 2015 a= t 11:06 AM, Vasiliki Kalavri <vasilikikalavri@gmail.com> wrote:
Hi,

thanks Nick and Ovidiu for the links!

Just to clarify, we're not looking= into creating a generic streaming benchmark. We have quite limited time an= d resources for this project. What we want is to decide on a set of 3-4 _co= mmon_ streaming applications. To give you an idea, for the batch workload, = we will pick something like a grep, one relational application, a graph alg= orithm, and an ML algorithm.

Cheers,
-Vasia.

On 16 = November 2015 at 19:25, Ovidiu-Cristian MARCU <ovidiu-cristia= n.marcu@inria.fr> wrote:
Regarding Flink vs Spark / Storm you can = check here:=C2=A0http://www.sparkbigdata.com/102-spark-blog-slim-baltagi/14= -results-of-a-benchmark-between-apache-flink-and-apache-spark

<= /div>
Best regards,
Ovidiu

=
On 16 Nov 2015, at 15:21, Vasiliki Kalavri &= lt;vasilikik= alavri@gmail.com> wrote:

Hello squirrels,

<= /div>
with some colleagues and students here at= KTH, we have started 2 projects to evaluate (1) performance and (2) behavi= or in the presence of memory interference in cloud environments, for Flink = and other systems. We want to provide our students with a workload of repre= sentative applications for testing.

<= /div>
While for batch applications, it is quite= clear to us what classes of applications are widely used and how to create= a workload of different types of applications, we are not quite sure about= the streaming workload.

That's why, we'd like your opinions! If you&= #39;re using Flink streaming in your company or your project, we'd love= your input even more :-)

What kind of applications would you consider as &qu= ot;representative" of a streaming workload? Have you run any experimen= ts to evaluate Flink versus Spark, Storm etc.? If yes, would you mind shari= ng your code with us?

We will of course be happy to share our results with ev= eryone after we have completed our study.

Thanks a lot!
-Vasia.




--001a113eb83ecbe4380524cde87f--