incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ChiaHung Lin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-387) Advanced Barrier Synchronization
Date Thu, 15 Sep 2011 01:06:08 GMT

    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105044#comment-13105044
] 

ChiaHung Lin commented on HAMA-387:
-----------------------------------

If I am correct, that looks like originally we do not deal with KeeperException.NodeExistsException,
which means znode proposed has already been created. We have several GroomServers starting
to create znode (e.g. JobId/superstep/TaskId) on zookeeper; therefore, it is possible to have
2 (or more) BSPPeers writing the same znode in the scene similar to check-then-act scenario.
For example, 2 BSPPeers check (zk.exists(path)) if znode path exists or not simultaneously,
then they decide to create the znode (zk.create(path...)) because the Stat returned is null,
indicating no znode exists. Unfortunately, one BSPPeer is writing fast than the other, resulting
in that the second BSPPeer fails in creating znode because znode exists. Thus all computation
hangs because `list.size() < jobConf.getNumBspTask()' is always true in while loop. 

For the ArrayIndexOutOfBoundsException, it seems the parameter peerName, which should be encoded
like host:port (in getAddress() peerName is split by `:' into an array), in BSPPeer.send()
function is malformed. 


> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch,
sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver
in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message