apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaurav Gupta <gau...@datatorrent.com>
Subject Re: Why is Async checkpointing made default?
Date Mon, 23 Nov 2015 09:03:41 GMT
If the requirement is that the order is always begingWindow()->endWindow()->checkpointed(),
why to pass windowId in the checkpointed() call back?

Thanks
- Gaurav

> On Nov 22, 2015, at 11:22 PM, Chandni Singh <chandni@datatorrent.com> wrote:
> 
> FYI,
> 
> HDHTWriter implementation is dependent on the older semantics and seems to
> be broken now.
> startWindow(x) -> endWindow(x) -> checkpointed(x)
> In the checkpointed implementation, it copies certain state (transient) and
> transfers it to a checkpointedWriteCache with respect to window 'x'.
> 
> With Async checkpointing it, the state that is transferred is much more
> recent than window 'x'.
> 
> Chandni
> 
> 
> On Sun, Nov 22, 2015 at 11:04 PM, Chandni Singh <chandni@datatorrent.com>
> wrote:
> 
>> Agreed. Thomas's solution fixes the backward incompatibility. I think we
>> really need to fix this.
>> 
>> On Sun, Nov 22, 2015 at 10:23 PM, Timothy Farkas <tim@datatorrent.com>
>> wrote:
>> 
>>> Gaurav,
>>> 
>>> I think if the state copy fails then STRAM should roll back the operator
>>> to
>>> a checkpoint that is further back than the last checkpoint. If you are
>>> saying that you want to preserve the semantic that checkpointed is only
>>> called after a checkpoint is completed, I would argue that that guarantee
>>> is already pointless in the current implementation since it is always
>>> possible for an operator to be rolled back to a checkpoint before it's
>>> last
>>> completed checkpoint. So, it is already currently possible for some
>>> database or file operation performed after a completed checkpoint to be
>>> redone after a failure. Because of this I think Thomas's solution makes
>>> the
>>> most sense. Thomas's solution would also address Chandni's original point
>>> that the semantics for the checkpointed call back have been violated.
>>> There
>>> are operators in our libraries that have depended on the beginWindow(x),
>>> endWindow(x), and checkpointed(x) call sequence, which is now broken. We
>>> should probably fix that.
>>> 
>>> Tim
>>> 
>>> On Sun, Nov 22, 2015 at 10:02 PM, Gaurav Gupta <gaurav@datatorrent.com>
>>> wrote:
>>> 
>>>> Thomas,
>>>> 
>>>> This was done to preserve checkpointing semantics that is to tell the
>>>> operator that its state is preserved. Say if database is updated or
>>> files
>>>> are moved in checkpointed call but the state copy fails, how to address
>>>> such scenarios?
>>>> 
>>>> Thanks
>>>> - Gaurav
>>>> 
>>>>> On Nov 22, 2015, at 9:44 PM, Thomas Weise <thomas@datatorrent.com>
>>>> wrote:
>>>>> 
>>>>> Alternatively I would ask why the checkpointed callback needs to wait
>>>> until
>>>>> the data was copied to HDFS instead upon completion of the state
>>>>> serialization.
>>>>> 
>>>>> Thomas
>>>>> 
>>>>> 
>>>>> On Sun, Nov 22, 2015 at 9:41 PM, Chandni Singh <
>>> chandni@datatorrent.com>
>>>>> wrote:
>>>>> 
>>>>>> Gaurav,
>>>>>> 
>>>>>> My question is about why Async was made the default when it changed
>>> the
>>>>>> semantics of operator callbacks. Your response doesn't answer that.
>>>>>> 
>>>>>> In a way we broke backward compatibility.
>>>>>> 
>>>>>> Chandni
>>>>>> 
>>>>>> On Sun, Nov 22, 2015 at 9:22 PM, Gaurav Gupta <
>>> gaurav@datatorrent.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> The idea behind Async checkpointing is to unblock operator while
the
>>>>>> state
>>>>>>> is getting transferred to HDFS.
>>>>>>> Just to clarify that this beginWindow (x) -> endWindow(x)
->
>>>> checkpointed
>>>>>>> (x-1 ) should be an ideal sequence, but if the HDFS is slow or
for
>>> some
>>>>>>> other reason transferring the state to HDFS is slow this sequence
>>> may
>>>> not
>>>>>>> hold true.
>>>>>>> 
>>>>>>> Can your use case be addressed by
>>>>>>> https://malhar.atlassian.net/browse/APEX-78 <
>>>>>>> https://malhar.atlassian.net/browse/APEX-78>?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> - Gaurav
>>>>>>> 
>>>>>>>> On Nov 22, 2015, at 3:56 PM, Chandni Singh <
>>> chandni@datatorrent.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> With Async checkpointing the checkpoint callback in CheckpointPoint
>>>>>>>> listener is called for a previous window, that is,
>>>>>>>> beginWindow (x) -> endWindow(x) -> checkpointed (x-1
)
>>>>>>>> 
>>>>>>>> This feature was newly introduced. With synchronous checkpointing,
>>> the
>>>>>>>> behavior was always
>>>>>>>> beginWindow(x) -> endWindow(x) -> checkpointed (x)
>>>>>>>> 
>>>>>>>> A lot of operators were written before asynchronous checkpointing
>>> was
>>>>>>>> introduced and few of them can rely on the sequencing guaranteed
by
>>>>>>>> synchronous checkpointing.
>>>>>>>> 
>>>>>>>> So why was Async Checkpointed made default?
>>>>>>>> 
>>>>>>>> With how Async checkpoint is today, the complexity to handle
>>> transient
>>>>>>>> state in checkpointed callback falls on every operator. For
eg,
>>> lets
>>>>>> say
>>>>>>>> earlier I had a transient map which I cleared every time
the
>>>>>> checkpointed
>>>>>>>> was called, with async checkpointing this simple task will
be a lot
>>>>>> more
>>>>>>>> complicated.
>>>>>>>> 
>>>>>>>> I think Async checkpointing broke the semantics of operator
>>> callbacks
>>>>>> and
>>>>>>>> should NOT be the default.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message