hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suraj Menon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-633) Fix CI Failure
Date Mon, 20 Aug 2012 20:17:38 GMT

    [ https://issues.apache.org/jira/browse/HAMA-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438158#comment-13438158
] 

Suraj Menon commented on HAMA-633:
----------------------------------

I believe ACK will not help here. I think we should change the MessageManager's transfer API
call to include the intended superstep number as an argument. I don't want to commit on this
solution until I am done brainstorming on the new Superstep API changes that I want to propose.

Side-note: Coming back to ACK, We are already using RPC to send messages, ACK would be an
overkill. We should introduce/use ACK only if the sender of message wants to ensure that the
receiver has completely processed the message it received.
                
> Fix CI Failure
> --------------
>
>                 Key: HAMA-633
>                 URL: https://issues.apache.org/jira/browse/HAMA-633
>             Project: Hama
>          Issue Type: Bug
>          Components: bsp core
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>             Fix For: 0.6.0
>
>
> The current nightly fails because it seems to read messages that actually belong to the
previous superstep.
> This is reproducable also in the local runner, so this is no problem of the specific
RPC implementations. The problem could also be in the GraphJobRunner.
> This is going to be expressed by a nullpointer exception, when a non-master tasks gets
a aggregation message (which actually just belongs to the master).
> {noformat}
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.4572019638123739
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.44247448197562855
> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level KeeperException when
processing sessionid:0x13936d5d8cc0002 type:create cxid:0x3de zxid:0xfffffffffffffffe txntype:unknown
reqpath:n/a Error Path:/bsp/job_201208172305_0001/sync/51 Error:KeeperErrorCode = NodeExists
for /bsp/job_201208172305_0001/sync/51
> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level KeeperException when
processing sessionid:0x13936d5d8cc0002 type:create cxid:0x3e8 zxid:0xfffffffffffffffe txntype:unknown
reqpath:n/a Error Path:/bsp/job_201208172305_0001/sync/51/ready Error:KeeperErrorCode = NodeExists
for /bsp/job_201208172305_0001/sync/51/ready
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.457610574551534
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.2675231554874198
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 11
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ NULL! false janus.apache.org:61001
> 12/08/17 23:05:52 ERROR bsp.BSPTask: Error running bsp setup and bsp function.
> java.lang.NullPointerException
>         at org.apache.hama.graph.GraphJobRunner.parseMessages(GraphJobRunner.java:373)
>         at org.apache.hama.graph.GraphJobRunner.bsp(GraphJobRunner.java:209)
>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166)
>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:143)
>         at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1271)
> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Processed session termination for
sessionid: 0x13936d5d8cc0002
> {noformat}
> It is very difficult to track this down, my ideas were:
> - It changes the host because of fault tolerance (contra arguments: its turned off and
the port is smaller than the other one)
> - Messaging is broken (would also explain why pagerank does not converge anymore)
> Some more info:
> I know that this happens when the master tasks sends a message with the updated aggregator
values to every slave. (line 239). Then this only message should be consumed arround line
246 cc.
> But it still remains in the buffer and will be consumed after all computation in line
209 in the parseMessages. 
> Even clearing the buffer is not fixing it. 
> The worst problem is, that this is not reproducable, the failure seems to happen just
in only two to three tenth of all builds. Seems like some really nasty edge case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message