apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Yan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (APEXCORE-313) BufferServer not purged correctly in StramLocalCluster
Date Sat, 23 Jan 2016 22:39:39 GMT

     [ https://issues.apache.org/jira/browse/APEXCORE-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

David Yan updated APEXCORE-313:
-------------------------------
    Description: 
When an operator dies, the output data for that operator in buffer server should be invalidated.
 Currently it's not and unless we do this: 
{code}
localCluster.setPerContainerBufferServer(true);
{code}
it's possible for a newly recovered operator to get the ghost data from an upstream operator
in the same checkpoint group that is still in the process of recovering.  When the upstream
operator finally recovers, it tries to send the data from the recovery checkpoint that is
duplicate of the ghost data, thus putting the whole thing in a bad state.

How to reproduce:

In DelayOperatorTest.java, comment out the lines with localCluster.setPerContainerBufferServer(true),
and run testFibonacciRecovery1, at recovery, the FIB operator becomes blocked because of this
problem.  STRAM detects the blocked operator after 30 seconds and redeploys the operators
again and things go back to normal.  The unit test eventually passes but the recovery takes
more than 30 seconds because of this problem.

  was:
When an operator dies, the output data for that operator in buffer server should be invalidated.
 Currently it's not and unless we do this: 
{code}
localCluster.setPerContainerBufferServer(true);
{code}
it's possible for a newly recovered operator to get the ghost data from an upstream operator
in the same checkpoint group that is still in the process of recovering.  When the upstream
operator finally recovers, it tries to send the data from the recovery checkpoint that is
duplicate of the ghost data, thus putting the whole thing in a bad state.



> BufferServer not purged correctly in StramLocalCluster 
> -------------------------------------------------------
>
>                 Key: APEXCORE-313
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-313
>             Project: Apache Apex Core
>          Issue Type: Bug
>            Reporter: David Yan
>
> When an operator dies, the output data for that operator in buffer server should be invalidated.
 Currently it's not and unless we do this: 
> {code}
> localCluster.setPerContainerBufferServer(true);
> {code}
> it's possible for a newly recovered operator to get the ghost data from an upstream operator
in the same checkpoint group that is still in the process of recovering.  When the upstream
operator finally recovers, it tries to send the data from the recovery checkpoint that is
duplicate of the ghost data, thus putting the whole thing in a bad state.
> How to reproduce:
> In DelayOperatorTest.java, comment out the lines with localCluster.setPerContainerBufferServer(true),
and run testFibonacciRecovery1, at recovery, the FIB operator becomes blocked because of this
problem.  STRAM detects the blocked operator after 30 seconds and redeploys the operators
again and things go back to normal.  The unit test eventually passes but the recovery takes
more than 30 seconds because of this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message