spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jungtaek Lim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24036) Stateful operators in continuous processing
Date Wed, 25 Apr 2018 22:54:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453209#comment-16453209
] 

Jungtaek Lim commented on SPARK-24036:
--------------------------------------

Maybe better to share what I've observed from continuous mode so far.
 * It leverages iterator hack to make logical batch (epoch) in stream.
 ** While iterator works different from normal, it doesn't touch existing operators by putting assumption
that all operators are chained and fit to single stage.
 ** With this assumption, only WriteToContinuousDataSourceExec needs to know how to deal
with iterator hack.
 ** Above assumption requires no repartition, which most of stateful operators need to deal
with.
 * Based on the hack, actually it doesn't put epoch marker flow through downstreams.
 ** To apply distributed snapshot it is mandatory, but it might require non-trivial change
of existing model, since checkpoint should be handled from each stateful operator and stored
in distributed manner, and coordinator should be able to check snapshots from all tasks are
taken correctly.
 ** This would be unnecessary change for batch, and making existing model being much complicated.
 ** This would bring latency concerns, since each operator should stop processing while
taking a snapshot. (I guess sending or storing snapshot still could be done asynchronously.)
 ** If there're more than one upstreams, it should arrange sequences between upstreams to
take a snapshot with only proper data within epoch.

So there is a huge challenge with existing model to extend continuous mode to support stateful
exactly-once (not about end-to-end exactly once, since it also depends on sink), and I'd
like to see the follow-up idea/design doc around continuous mode to see the direction of
continuous mode: whether relying on such assumption and try to explore (may need to have
more hacks/workarounds), or willing to discard assumption and redesign.

Most of features are supported with micro-batch manner, so also would like to see the goal
of continuous mode. Is it to cover all or most of features being supported with micro-batch?
Or is the goal of continuous mode only to cover low latency use cases?

 

> Stateful operators in continuous processing
> -------------------------------------------
>
>                 Key: SPARK-24036
>                 URL: https://issues.apache.org/jira/browse/SPARK-24036
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.4.0
>            Reporter: Jose Torres
>            Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with stateful
operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message