flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andra Lungu <lungu.an...@gmail.com>
Subject Re: Creating a representative streaming workload
Date Tue, 24 Nov 2015 13:46:06 GMT

Sorry for the ultra-late reply.

Another real-life streaming scenario would be the one I am working on:
- collecting data from telecom cells in real-time
- and filtering out certain information or enriching/correlating (adding
additional info based on the parameters received) events
- this is done in order to understand what is happening in the network and
to ensure better quality of service.

As for Robert's proposal, I'd like to work on the stream generator if there
is no time constraint, but first of all I'd like to hear more details. What
kind of data are we generating? How many fields are there and of what type?
Ideally, the user calling this generator should be able to make this
decision. Can we create a JIRA for this? This way, it would be easier to
start working on the task.


On Wed, Nov 18, 2015 at 12:14 PM, Robert Metzger <rmetzger@apache.org>

> Hey Vasia,
> I think a very common workload would be an event stream from web servers
> of an online shop. Usually, these shops have multiple servers, so events
> arrive out of order.
> I think there are plenty of different use cases that you can build around
> that data:
> - Users perform different actions that a streaming system could track
> (analysis of click-paths),
> - some simple statistics using windows (items sold in the last 10 minutes,
> ..).
> - Maybe fraud detection would be another use case.
> - Often, there also needs to be a sink to HDFS or another file system for
> a long-term archive.
> I would love to see such an event generator in flink's contrib module. I
> think that's something the entire streaming space could use.
> On Mon, Nov 16, 2015 at 8:22 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>> All those should apply for streaming too...
>> On Mon, Nov 16, 2015 at 11:06 AM, Vasiliki Kalavri <
>> vasilikikalavri@gmail.com> wrote:
>>> Hi,
>>> thanks Nick and Ovidiu for the links!
>>> Just to clarify, we're not looking into creating a generic streaming
>>> benchmark. We have quite limited time and resources for this project. What
>>> we want is to decide on a set of 3-4 _common_ streaming applications. To
>>> give you an idea, for the batch workload, we will pick something like a
>>> grep, one relational application, a graph algorithm, and an ML algorithm.
>>> Cheers,
>>> -Vasia.
>>> On 16 November 2015 at 19:25, Ovidiu-Cristian MARCU <
>>> ovidiu-cristian.marcu@inria.fr> wrote:
>>>> Regarding Flink vs Spark / Storm you can check here:
>>>> http://www.sparkbigdata.com/102-spark-blog-slim-baltagi/14-results-of-a-benchmark-between-apache-flink-and-apache-spark
>>>> Best regards,
>>>> Ovidiu
>>>> On 16 Nov 2015, at 15:21, Vasiliki Kalavri <vasilikikalavri@gmail.com>
>>>> wrote:
>>>> Hello squirrels,
>>>> with some colleagues and students here at KTH, we have started 2
>>>> projects to evaluate (1) performance and (2) behavior in the presence of
>>>> memory interference in cloud environments, for Flink and other systems. We
>>>> want to provide our students with a workload of representative applications
>>>> for testing.
>>>> While for batch applications, it is quite clear to us what classes of
>>>> applications are widely used and how to create a workload of different
>>>> types of applications, we are not quite sure about the streaming workload.
>>>> That's why, we'd like your opinions! If you're using Flink streaming in
>>>> your company or your project, we'd love your input even more :-)
>>>> What kind of applications would you consider as "representative" of a
>>>> streaming workload? Have you run any experiments to evaluate Flink versus
>>>> Spark, Storm etc.? If yes, would you mind sharing your code with us?
>>>> We will of course be happy to share our results with everyone after we
>>>> have completed our study.
>>>> Thanks a lot!
>>>> -Vasia.

View raw message