spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-7441) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches
Date Wed, 24 Jun 2015 17:51:05 GMT

     [ https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Rosen updated SPARK-7441:
------------------------------
    Target Version/s: 1.5.0  (was: 1.4.1)

> Implement microbatch functionality so that Spark Streaming can process a large backlog
of existing files discovered in batch in smaller batches
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-7441
>                 URL: https://issues.apache.org/jira/browse/SPARK-7441
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Emre Sevinç
>              Labels: performance
>
> Implement microbatch functionality so that Spark Streaming can process a huge backlog
of existing files discovered in batch in smaller batches.
> Spark Streaming can process already existing files in a directory, and depending on the
value of "{{spark.streaming.minRememberDuration}}" (60 seconds by default, see SPARK-3276
for more details), this might mean that a Spark Streaming application can receive thousands,
or hundreds of thousands of files within the first batch interval. This, in turn, leads to
something like a 'flooding' effect for the streaming application, that tries to deal with
a huge number of existing files in a single batch interval.
>  We will propose a very simple change to {{org.apache.spark.streaming.dstream.FileInputDStream}},
so that, based on a configuration property such as "{{spark.streaming.microbatch.size}}",
it will either keep its default behavior when  {{spark.streaming.microbatch.size}} will have
the default value of {{0}} (meaning as many as has been discovered as new files in the current
batch interval), or will process new files in groups of {{spark.streaming.microbatch.size}}
(e.g. in groups of 100s).
> We have tested this patch in one of our customers, and it's been running successfully
for weeks (e.g. there were cases where our Spark Streaming application was stopped, and in
the meantime tens of thousands file were created in a directory, and our Spark Streaming application
had to process those existing files after it was started).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message