spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shixiong Zhu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-24699) Watermark / Append mode should work with Trigger.Once
Date Fri, 10 Aug 2018 20:50:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-24699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shixiong Zhu updated SPARK-24699:
---------------------------------
    Fix Version/s: 2.4.0

> Watermark / Append mode should work with Trigger.Once
> -----------------------------------------------------
>
>                 Key: SPARK-24699
>                 URL: https://issues.apache.org/jira/browse/SPARK-24699
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.3.1
>            Reporter: Chris Horn
>            Assignee: Tathagata Das
>            Priority: Major
>             Fix For: 2.4.0, 3.0.0
>
>         Attachments: watermark-once.scala, watermark-stream.scala
>
>
> I have a use case where I would like to trigger a structured streaming job from an external
scheduler (once every 15 minutes or so) and have it write window aggregates to Kafka.
> I am able to get my code to work when running with `Trigger.ProcessingTime` but when
I switch to `Trigger.Once` the watermarking feature of structured streams does not persist
to (or is not recollected from) the checkpoint state.
> This causes the stream to never generate output because the watermark is perpetually
stuck at `1970-01-01T00:00:00Z`.
> I have created a failing test case in the `EventTimeWatermarkSuite`, I will create a
[WIP] pull request on github and link it here.
>  
> It seems that even if it generated the watermark, and given the current streaming behavior,
I would have to trigger the job twice to generate any output.
>  
> The microbatcher only calculates the watermark off of the previous batch's input and
emits new aggs based off of that timestamp.
> This state is not available to a newly started `MicroBatchExecution` stream.
> Would it be an appropriate strategy to create a new checkpoint file with the most up
to watermark or watermark and query stats?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message