spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Elamin <hussam.ela...@gmail.com>
Subject Re: Structured Streaming. Dropping Duplicates
Date Tue, 07 Feb 2017 17:05:18 GMT
On another note, when it comes to checkpointing on structured streaming

I noticed if I have  a stream running off s3 and I kill the process. The
next time the process starts running it dulplicates the last record
inserted. is that normal?




So say I have streaming enabled on one folder "test" which only has two
files "update1" and "update 2", then I kill the spark job using Ctrl+C.
When I rerun the stream it picks up "update 2" again

Is this normal? isnt ctrl+c a failure?

I would expect checkpointing to know that update 2 was already processed

Regards
Sam

On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin <hussam.elamin@gmail.com> wrote:

> Thanks Micheal!
>
>
>
> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust <michael@databricks.com>
> wrote:
>
>> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497
>>
>> We should add this soon.
>>
>> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin <hussam.elamin@gmail.com>
>> wrote:
>>
>>> Hi All
>>>
>>> When trying to read a stream off S3 and I try and drop duplicates I get
>>> the following error:
>>>
>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>> Append output mode not supported when there are streaming aggregations on
>>> streaming DataFrames/DataSets;;
>>>
>>>
>>> Whats strange if I use the batch "spark.read.json", it works
>>>
>>> Can I assume you cant drop duplicates in structured streaming
>>>
>>> Regards
>>> Sam
>>>
>>
>>
>

Mime
View raw message