incubator-giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhiwei Gu (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (GIRAPH-154) Worker ports are not synched properly with its peers
Date Fri, 16 Mar 2012 01:10:39 GMT

     [ https://issues.apache.org/jira/browse/GIRAPH-154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zhiwei Gu updated GIRAPH-154:
-----------------------------

    Attachment: GIRAPH-154.patch

passed unit test and grid test.
                
> Worker ports are not synched properly with its peers
> ----------------------------------------------------
>
>                 Key: GIRAPH-154
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-154
>             Project: Giraph
>          Issue Type: Bug
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Zhiwei Gu
>            Assignee: Zhiwei Gu
>         Attachments: GIRAPH-154.patch
>
>
> When worker trying multiple ports to setup the rpc server, the final port is not synched
with it's peer workers properly, and resulted in peer workers send message to the default
port.
> Here is some logs:
> ############################################################################
> Base port: 34900
> ############################################################################
> ############################################################################
> log for worker 161:
> ############################################################################
> IPC Server handler 98 on 36061: starting
> BasicRPCCommunications: Started RPC communication server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:36061
with 100 handlers and 199 flush threads on bind attempt 1
> IPC Server handler 99 on 36061: starting
> setup: Registering health of this worker...
> getJobState: Job state already exists (/_hadoopBsp/job_201203130609_14838/_masterJobState)
> getApplicationAttempt: Node /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir
already exists!
> getApplicationAttempt: Node /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir
already exists!
> registerHealth: Created my health node for attempt=0, superstep=-1 with /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir/gsta32085.tan.ygrid.yahoo.com_161
and workerInfo= Worker(hostname=gsta32085.tan.ygrid.yahoo.com, MRpartition=161, port=35061)
> process: partitionAssignmentsReadyChanged (partitions are assigned)
> startSuperstep: Ready for computation on superstep -1 since worker selection and vertex
range assignments are done in /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir/0/_superstepDir/-1/_partitionAssignments
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 0 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 1 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 2 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 3 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 4 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 5 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 6 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 7 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 8 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 9 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 10 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 11 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 12 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 13 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 14 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 15 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 16 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 17 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 18 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 19 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 20 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 21 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 22 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 23 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 24 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 25 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 26 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 27 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 28 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 29 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 30 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 31 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 32 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 33 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 34 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 35 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 36 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 37 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 38 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 39 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 40 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 41 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 42 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 43 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 44 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 45 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 46 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 47 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 48 time(s).
> Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already
tried 49 time(s).
> PriviledgedActionException as:job_201203130609_14838 (auth:SIMPLE) cause:java.net.ConnectException:
Call to gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection exception:
java.net.ConnectException: Connection refused
> connectAllRPCProxys: Failed on attempt 0 of 5 to connect to (id=33,cur=Worker(hostname=gsta32085.tan.ygrid.yahoo.com,
MRpartition=161, port=35061),prev=null,ckpt_file=null)
> java.net.ConnectException: Call to gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061
failed on connection exception: java.net.ConnectException: Connection refused
> 	at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1071)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
> 	at $Proxy8.getProtocolVersion(Unknown Source)
> 	at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
> 	at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370)
> 	at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420)
> 	at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:159)
> 	at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:155)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082)
> 	at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:153)
> 	at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:51)
> 	at org.apache.giraph.comm.BasicRPCCommunications.startPeerConnectionThread(BasicRPCCommunications.java:599)
> 	at org.apache.giraph.comm.BasicRPCCommunications.connectAllRPCProxys(BasicRPCCommunications.java:542)
> 	at org.apache.giraph.comm.BasicRPCCommunications.setup(BasicRPCCommunications.java:513)
> 	at org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:550)
> 	at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:458)
> 	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:630)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.net.ConnectException: Connection refused
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> 	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> 	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656)
> 	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
> 	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
> 	at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
> 	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1046)
> 	... 25 more
> ############################################################################
> log for worker 154
> ############################################################################
> PriviledgedActionException as:job_201203130609_14838 (auth:SIMPLE) cause:java.net.ConnectException:
Call to gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection exception:
java.net.ConnectException: Connection refused
> connectAllRPCProxys: Failed on attempt 4 of 5 to connect to (id=33,cur=Worker(hostname=gsta32085.tan.ygrid.yahoo.com,
MRpartition=161, port=35061),prev=null,ckpt_file=null)
> java.net.ConnectException: Call to gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061
failed on connection exception: java.net.ConnectException: Connection refused
> 	at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1071)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
> 	at $Proxy8.getProtocolVersion(Unknown Source)
> 	at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
> 	at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370)
> 	at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420)
> 	at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:159)
> 	at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:155)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082)
> 	at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:153)
> 	at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:51)
> 	at org.apache.giraph.comm.BasicRPCCommunications.startPeerConnectionThread(BasicRPCCommunications.java:599)
> 	at org.apache.giraph.comm.BasicRPCCommunications.connectAllRPCProxys(BasicRPCCommunications.java:542)
> 	at org.apache.giraph.comm.BasicRPCCommunications.setup(BasicRPCCommunications.java:513)
> 	at org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:550)
> 	at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:458)
> 	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:630)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.net.ConnectException: Connection refused
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> 	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> 	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656)
> 	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
> 	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
> 	at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
> 	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1046)
> 	... 25 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message