spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paulo Cândido (JIRA) <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (SPARK-19125) Streaming Duration by Count
Date Thu, 12 Jan 2017 11:25:52 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Paulo Cândido updated SPARK-19125:
----------------------------------
    Comment: was deleted

(was: Hi Mr. Owen,

Thank you for your attention. Your alternative solution will works fine to me. As I use a
 data generator, I can generate all data into micro batches before start. So, every interval,
I will have the expected data. It is enough to make my experiments reproducible, at least
on high level.

Thank you.)

> Streaming Duration by Count
> ---------------------------
>
>                 Key: SPARK-19125
>                 URL: https://issues.apache.org/jira/browse/SPARK-19125
>             Project: Spark
>          Issue Type: Improvement
>          Components: DStreams
>         Environment: Java
>            Reporter: Paulo Cândido
>
> I use the Spark Streaming in scientific way. In this cases, we have to run the same experiment
many times using the same seed to obtain the same result. All randomic components have the
seed as input, so I can controll it. However, there is a unique component that doesn't depend
of seeds and we can't controll, it's the bach size. Regardless of the input way of stream,
the metric to break the microbaches is wall time. It's a problem in scientific environment
because if we run the same experiments with same param many times, each time we can get a
diferent result, depending the quantity of elements read in each bach. The same stream source
may generate diferent bach sizes on multiple executions because of wall time.
> My sugestion is provide a new Duration metric: Count of Elements.
> Regardless of time spent to fill a microbatch, they will be always the same size, and
when the source has a seed to generate de same values, independent of throughput, we will
can replicate the experiments with same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message