giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Cornell <m...@matthewcornell.org>
Subject "Missing chosen worker" ERROR drills down to "end of stream exception" ("likely client has closed socket"). help!
Date Tue, 28 Oct 2014 15:01:38 GMT
Hi All,

I have a Giraph 1.0.0 job that has failed, but I'm not able to get
detail as to what really happened. The master's log says:

> 2014-10-28 10:28:32,006 ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive:
Missing chosen worker Worker(hostname=compute-0-0.wright, MRtaskID=1, port=30001) on superstep
4

OK, this seems to say compute-0-0 failed in some way, correct? The
Ganglia pages show no noticeable OS differences between the failed
node and another identical compute node. In the failed node's log I
see two WARNs:

> 2014-10-28 10:28:19,560 WARN org.apache.giraph.bsp.BspService: process: Disconnected
from ZooKeeper (will automatically try to recover) WatchedEvent state:Disconnected type:None
path:null
> 2014-10-28 10:28:19,560 WARN org.apache.giraph.worker.InputSplitsHandler: process: Problem
with zookeeper, got event with path null, state Disconnected, event type None

OK, I guess there was a zookeeper issue. In the Zookeeper log I find:

> 2014-10-28 10:28:14,917 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of
stream exception
> EndOfStreamException: Unable to read additional data from client sessionid 0x149529c74de0a4d,
likely client has closed socket
>         at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
>         at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>         at java.lang.Thread.run(Thread.java:745)

OK, so I guess the socket closure was the problem. But why did *that* happen?

I could really use your help here!

Thank you,

matt


-- 
Matthew Cornell | matt@matthewcornell.org

Mime
View raw message