apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vlad Rozov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (APEXCORE-703) Window processing timeout for finished/undeployed container
Date Sun, 16 Apr 2017 04:26:42 GMT

    [ https://issues.apache.org/jira/browse/APEXCORE-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970233#comment-15970233
] 

Vlad Rozov commented on APEXCORE-703:
-------------------------------------

[~thw] I don't see undeployed operators being removed from the plan immediately and the state
of such operators is not updated, so they are still considered to be ACTIVE and are subject
to the blocked operators check. Why StreamingContainerManager does not set the state of an
operator to INACTIVE when it is added to the list of operators to undeploy?
{noformat}
            case SHUTDOWN:
              // schedule operator deactivation against the windowId
              // will be processed once window is committed and all dependent operators completed
processing
              long windowId = oper.stats.currentWindowId.get();
              if (ohb.windowStats != null && !ohb.windowStats.isEmpty()) {
                windowId = ohb.windowStats.get(ohb.windowStats.size() - 1).windowId;
              }
              LOG.debug("Operator {} deactivated at window {}", oper, windowId);
              synchronized (this.shutdownOperators) {
                Set<PTOperator> deactivatedOpers = this.shutdownOperators.get(windowId);
                if (deactivatedOpers == null) {
                  this.shutdownOperators.put(windowId, deactivatedOpers = new HashSet<>());
                }
                deactivatedOpers.add(oper);
              }
              sca.undeployOpers.add(oper.getId());
  ---> can we mark operator as inactive here?             
              slowestUpstreamOp.remove(oper);
              // record operator stop event
              recordEventAsync(new StramEvent.StopOperatorEvent(oper.getName(), oper.getId(),
oper.getContainer().getExternalId()));
              break;
{noformat} 

> Window processing timeout for finished/undeployed container
> -----------------------------------------------------------
>
>                 Key: APEXCORE-703
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-703
>             Project: Apache Apex Core
>          Issue Type: Bug
>    Affects Versions: 3.5.0
>            Reporter: Daniel Halperin
>
> Using Apex 3.5.0 with Apache Beam, I have a 10-container pipeline. The first container,
id #1, finishes and gets undeployed at 12:41:10 PM.
> Then, 60s later (at 12:42:10 PM), Apex decides that container is blocked because no data
has been received for 60s, declares failure, and restarts it.
> This would seem to be a bug -- shouldn't finished and undeployed operators be deregistered
from the timeout logic that is detecting stuck operators?
> Log below
> {code}
> Apr 14, 2017 12:41:10 PM com.datatorrent.stram.engine.StreamingContainer processHeartbeatResponse
> INFO: Undeploy request: [1]
> Apr 14, 2017 12:41:10 PM com.datatorrent.stram.engine.StreamingContainer undeploy
> INFO: Undeploy complete.
> Apr 14, 2017 12:42:10 PM com.datatorrent.stram.StreamingContainerManager updateRecoveryCheckpoints
> WARNING: Marking operator PTOperator[id=1,name=TextIO.Read/Read] blocked committed window
ffffffffffffffff, recovery window ffffffffffffffff, current time 1492198930012, last window
id change time 1492198869957, window processing timeout millis 60000
> Apr 14, 2017 12:42:10 PM com.datatorrent.stram.StreamingContainerManager updateCheckpoints
> INFO: Blocked operator PTOperator[id=1,name=TextIO.Read/Read] container PTContainer[id=1(container-6),state=ACTIVE]
time 60055ms
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.engine.StreamingContainer processHeartbeatResponse
> INFO: Received shutdown request
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StramLocalCluster run
> INFO: Container container-6 restart.
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StreamingContainerManager scheduleContainerRestart
> INFO: Initiating recovery for container-6@localhost
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StreamingContainerManager updateRecoveryCheckpoints
> WARNING: Marking operator PTOperator[id=1,name=TextIO.Read/Read] blocked committed window
ffffffffffffffff, recovery window ffffffffffffffff, current time 1492198931015, last window
id change time 1492198869957, window processing timeout millis 60000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message