incubator-s4-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gavin Li (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (S4-3) Sometimes one process node owns 2 tasks
Date Fri, 14 Oct 2011 10:48:11 GMT

    [ https://issues.apache.org/jira/browse/S4-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127425#comment-13127425
] 

Gavin Li commented on S4-3:
---------------------------

Actually I have some concern about the way of checking the existence and owner of the znode.
The ConnectionLossException exception happens when the sendThread in zookeeper client found
the response times out. Say some process of znode creation(including the proposal, ack and
commit on zookeeper servers) takes a long time period, then the SendThread consider it times
out, then zookeeper.create() function fails with ConnectionLossException, then we try to read
that znode to see if it exists and who is its owner. There are chances that when the read
request is issued to the zookeeper server, the creation process is still ongoing. So the read
would end up with not found the znode. But the znode creation might succeeds after the read
request is served. As the read only directly check the status on the zookeeper server the
client is connecting to, doesn't contain any consensus voting process, it should be faster.
Do you think this can happen?

So I think maybe close the session is safer. I also think it is a little bit complicated to
implement as besides calling zookeeper.close() we also needs to construct a new instance of
Zookeeper class in order to create a new session, that involves the more code change. I guess
that's why Both Hbase and Hedwig choose to simply let the process exits to restart.  

What do you think?
                
> Sometimes one process node owns 2 tasks
> ---------------------------------------
>
>                 Key: S4-3
>                 URL: https://issues.apache.org/jira/browse/S4-3
>             Project: Apache S4
>          Issue Type: Bug
>            Reporter: Gavin Li
>            Assignee: Gavin Li
>         Attachments: s4_loscon_fix
>
>
> When using S4, we found sometimes it ends up with one process node owns 2 tasks. I did
some investigation, it seems that the handling of ConnectionLossException when creating the
ephemeral node is problematic. Sometimes when the response from zookeeper server times out,
zookeeper.create() will fail with ConnectionLossException while the creation request might
already be sent to server(see http://svn.apache.org/viewvc/hadoop/zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java
line 830). From our logs this is the case we ran into.
> Maybe we should handle it in the way that HBase is handling it (http://svn.apache.org/viewvc/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java?view=markup),
just simply exit the process when got that exception to let the whole process restart.
> To be more clear, what happened was: a process node called zookeeper.create() to acquire
a task, the request was successfully sent to zookeeper server, but the zookeeper IO loop timed
out before the response came. So the zookeeper.create() failed with ConnectionLossException.
Then the process node ignored this exception and tried to acquire another task. Then it got
2 tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message