incubator-s4-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "kishore gopalakrishna (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (S4-3) Sometimes one process node owns 2 tasks
Date Fri, 14 Oct 2011 16:48:12 GMT

    [ https://issues.apache.org/jira/browse/S4-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127685#comment-13127685
] 

kishore gopalakrishna commented on S4-3:
----------------------------------------

Flavio, the session close will work but compared to checking pNode exists it is more difficult.
Also closing session may not be a good idea if we add more ephemeral in other parts of the
code. So the right thing to do would be to check for pNode exists and ephemeralOwner comparison.

Gavin, do you know how the process will restart in Hbase,Hedwig ? I think they might have
some daemon process running that will restart a process if its down but we dont have such
scripts. Also its not guaranteed that this will happen only during start up, this might happen
any time a pNode is created which is either at start up or when another node dies and a standby
pick up the task.

As Flavio mentioned read after write is guaranteed and we should be fine with checking for
pNode existence.
                
> Sometimes one process node owns 2 tasks
> ---------------------------------------
>
>                 Key: S4-3
>                 URL: https://issues.apache.org/jira/browse/S4-3
>             Project: Apache S4
>          Issue Type: Bug
>            Reporter: Gavin Li
>            Assignee: Gavin Li
>         Attachments: s4_loscon_fix
>
>
> When using S4, we found sometimes it ends up with one process node owns 2 tasks. I did
some investigation, it seems that the handling of ConnectionLossException when creating the
ephemeral node is problematic. Sometimes when the response from zookeeper server times out,
zookeeper.create() will fail with ConnectionLossException while the creation request might
already be sent to server(see http://svn.apache.org/viewvc/hadoop/zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java
line 830). From our logs this is the case we ran into.
> Maybe we should handle it in the way that HBase is handling it (http://svn.apache.org/viewvc/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java?view=markup),
just simply exit the process when got that exception to let the whole process restart.
> To be more clear, what happened was: a process node called zookeeper.create() to acquire
a task, the request was successfully sent to zookeeper server, but the zookeeper IO loop timed
out before the response came. So the zookeeper.create() failed with ConnectionLossException.
Then the process node ignored this exception and tried to acquire another task. Then it got
2 tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message