giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Presta <>
Subject Re: Giraph/Netty issues on a cluster
Date Wed, 13 Feb 2013 19:35:30 GMT
Hi Zachary,

Are you running one of the examples or your own code?
It seems to me that a call to edge.getValue() is returning null, which should never happen.


From: Zachary Hanif <<>>
Reply-To: "<>" <<>>
Date: Wednesday, February 13, 2013 11:29 AM
To: "<>" <<>>
Subject: Giraph/Netty issues on a cluster

(How embarrassing! I forgot a subject header in a previous attempt to post this. Please reply
to this thread, not the other.)

Hi everyone,

I am having some odd issues when trying to run a Giraph 0.2 job across my CDH 3u3 cluster.
After building the jar, and deploying it across the cluster, I start to notice a handful of
my nodes reporting the following error:

2013-02-13 17:47:43,341 WARN org.apache.giraph.comm.netty.handler.ResponseClientHandler: exceptionCaught:
Channel failed with remote address <EDITED_INTERNAL_DNS>/<>
    at org.apache.giraph.vertex.EdgeListVertexBase.write(
    at org.apache.giraph.partition.SimplePartition.write(
    at org.apache.giraph.comm.requests.SendVertexRequest.writeRequest(
    at org.apache.giraph.comm.requests.WritableRequest.write(
    at org.apache.giraph.comm.netty.handler.RequestEncoder.encode(
    at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(
    at org.jboss.netty.handler.execution.ExecutionHandler.handleDownstream(
    at org.apache.giraph.comm.netty.NettyClient.sendWritableRequest(
    at org.apache.giraph.comm.netty.NettyWorkerClient.sendWritableRequest(
    at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.doRequest(
    at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.sendPartitionRequest(
    at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.flush(
    at java.util.concurrent.FutureTask$Sync.innerRun(
    at java.util.concurrent.ThreadPoolExecutor.runWorker(
    at java.util.concurrent.ThreadPoolExecutor$

What would be causing this? All other Hadoop jobs run well on the cluster, and when the Giraph
job is run with only one worker, it completes without any issues. When run with any number
of workers >1, the above error occurs. I have referenced this post<>
where superficially similar issues were discussed, but the root cause appears to be different,
and suggested methods of resolution are not panning out.

As extra background, the 'remote address' changes, as the error cycles through my available
cluster nodes, and the failing workers do not seem to favor one physical machine over another.
Not all nodes present this issue, only a handful per job. Is there soemthing simple that I
am missing?

View raw message