spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bin Wang <wbi...@gmail.com>
Subject Re: How to recovery DStream from checkpoint directory?
Date Thu, 17 Sep 2015 09:07:45 GMT
Thanks Adrian, the hint of use updateStateByKey with initialRdd helps a lot!

Adrian Tanase <atanase@adobe.com>于2015年9月17日周四 下午4:50写道:

> This section in the streaming guide makes your options pretty clear
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code
>
>
>    1. Use 2 versions in parallel, drain the queue up to a point and strat
>    fresh in the new version, only processing events from that point forward
>       1. Note that “up to a point” is specific to you state management
>       logic, it might mean “user sessions stated after 4 am” NOT “events received
>       after 4 am”
>    2. Graceful shutdown and saving data to DB, followed by checkpoint
>    cleanup / new checkpoint dir
>       1. On restat, you need to use the updateStateByKey that takes an
>       initialRdd with the values preloaded from DB
>       2. By cleaning the checkpoint in between upgrades, data is loaded
>       only once
>
> Hope this helps,
> -adrian
>
> From: Bin Wang
> Date: Thursday, September 17, 2015 at 11:27 AM
> To: Akhil Das
> Cc: user
> Subject: Re: How to recovery DStream from checkpoint directory?
>
> In my understand, here I have only three options to keep the DStream state
> between redeploys (yes, I'm using updateStateByKey):
>
> 1. Use checkpoint.
> 2. Use my own database.
> 3. Use both.
>
> But none of  these options are great:
>
> 1. Use checkpoint: I cannot load it after code change. Or I need to keep
> the structure of the classes, which seems to be impossible in a developing
> project.
> 2. Use my own database: there may be failure between the program read data
> from Kafka and save the DStream to database. So there may have data lose.
> 3. Use both: Will the data load two times? How can I know in which
> situation I should use the which one?
>
> The power of checkpoint seems to be very limited. Is there any plan to
> support checkpoint while class is changed, like the discussion you gave me
> pointed out?
>
>
>
> Akhil Das <akhil@sigmoidanalytics.com>于2015年9月17日周四 下午3:26写道:
>
>> Any kind of changes to the jvm classes will make it fail. By
>> checkpointing the data you mean using checkpoint with updateStateByKey?
>> Here's a similar discussion happened earlier which will clear your doubts i
>> guess
>> http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3CCA+AHuK=Xoy8dSdAOBMgm935GOqYtaaqkPqSvDAQPMOjOtTJkzA@mail.gmail.com%3E
>>
>> Thanks
>> Best Regards
>>
>> On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang <wbin00@gmail.com> wrote:
>>
>>> And here is another question. If I load the DStream from database every
>>> time I start the job, will the data be loaded when the job is failed and
>>> auto restart? If so, both the checkpoint data and database data are loaded,
>>> won't this a problem?
>>>
>>>
>>>
>>> Bin Wang <wbin00@gmail.com>于2015年9月16日周三 下午8:40写道:
>>>
>>>> Will StreamingContex.getOrCreate do this work?What kind of code change
>>>> will make it cannot load?
>>>>
>>>> Akhil Das <akhil@sigmoidanalytics.com>于2015年9月16日周三 20:20写道:
>>>>
>>>>> You can't really recover from checkpoint if you alter the code. A
>>>>> better approach would be to use some sort of external storage (like a
db or
>>>>> zookeeper etc) to keep the state (the indexes etc) and then when you
deploy
>>>>> new code they can be easily recovered.
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang <wbin00@gmail.com> wrote:
>>>>>
>>>>>> I'd like to know if there is a way to recovery dstream from
>>>>>> checkpoint.
>>>>>>
>>>>>> Because I stores state in DStream, I'd like the state to be recovered
>>>>>> when I restart the application and deploy new code.
>>>>>>
>>>>>
>>>>>
>>

Mime
View raw message