incubator-s4-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Flavio Junqueira (Commented) (JIRA)" <>
Subject [jira] [Commented] (S4-3) Sometimes one process node owns 2 tasks
Date Thu, 13 Oct 2011 20:05:12 GMT


Flavio Junqueira commented on S4-3:

Hi Gavin, It sounds right that the connection loss is not being handled properly. It seems
that the code was written assuming that the connection loss would cause pNode to be deleted,
since it is ephemeral. This is not the case, since connection loss does not mean that the
session has expired.

I have two suggestions:

# Instead of exiting, you may consider just closing the session, which will cause the node
to be deleted because it is ephemeral;
# We may consider checking if pNode exists upon catching the exception and before trying to
acquire a new task.

Does any of these options make sense? Also, as a reference, here is a wiki page about error
handling with ZooKeeper.
> Sometimes one process node owns 2 tasks
> ---------------------------------------
>                 Key: S4-3
>                 URL:
>             Project: Apache S4
>          Issue Type: Bug
>            Reporter: Gavin Li
>         Attachments: s4_loscon_fix
> When using S4, we found sometimes it ends up with one process node owns 2 tasks. I did
some investigation, it seems that the handling of ConnectionLossException when creating the
ephemeral node is problematic. Sometimes when the response from zookeeper server times out,
zookeeper.create() will fail with ConnectionLossException while the creation request might
already be sent to server(see
line 830). From our logs this is the case we ran into.
> Maybe we should handle it in the way that HBase is handling it (,
just simply exit the process when got that exception to let the whole process restart.
> To be more clear, what happened was: a process node called zookeeper.create() to acquire
a task, the request was successfully sent to zookeeper server, but the zookeeper IO loop timed
out before the response came. So the zookeeper.create() failed with ConnectionLossException.
Then the process node ignored this exception and tried to acquire another task. Then it got
2 tasks.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message