giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zachary Hanif <zh4...@gmail.com>
Subject Re: Giraph/Netty issues on a cluster
Date Wed, 13 Feb 2013 21:08:06 GMT
Well, this is a bit odd:

> 2013-02-13 20:58:45,740 INFO org.apache.giraph.worker.BspServiceWorker: loadInputSplits:
Using 1 thread(s), originally 1 threads(s) for 14 total splits.
> 2013-02-13 20:58:45,742 INFO org.apache.giraph.comm.netty.NettyServer: start: Using Netty
without authentication.
> 2013-02-13 20:58:45,744 INFO org.apache.giraph.comm.SendPartitionCache: SendPartitionCache:
maxVerticesPerTransfer = 10000
> 2013-02-13 20:58:45,744 INFO org.apache.giraph.comm.SendPartitionCache: SendPartitionCache:
maxEdgesPerTransfer = 80000
> 2013-02-13 20:58:45,745 INFO org.apache.giraph.comm.netty.NettyServer: start: Using Netty
without authentication.
> 2013-02-13 20:58:45,755 INFO org.apache.giraph.comm.netty.NettyServer: start: Using Netty
without authentication.
> 2013-02-13 20:58:45,758 INFO org.apache.giraph.comm.netty.NettyServer: start: Using Netty
without authentication.
> 2013-02-13 20:58:45,814 INFO org.apache.giraph.worker.InputSplitsCallable: call: Loaded
0 input splits in 0.07298644 secs, (v=0, e=0) 0.0 vertices/sec, 0.0 edges/sec
> 2013-02-13 20:58:45,817 INFO org.apache.giraph.comm.netty.NettyClient: waitAllRequests:
Finished all requests. MBytes/sec sent = 0, MBytes/sec received = 0, MBytesSent = 0, MBytesReceived
= 0, ave sent req MBytes = 0, ave received req MBytes = 0, secs waited = 8.303
> 2013-02-13 20:58:45,817 INFO org.apache.giraph.worker.BspServiceWorker: setup: Finally
loaded a total of (v=0, e=0)
>
> What would cause this? I imagine that it's related to my overall problem.

On Wed, Feb 13, 2013 at 3:31 PM, Zachary Hanif <zh4990@gmail.com> wrote:

> It is my own code. I'm staring at my VertexInputFormat class right now. It
> extends TextVertexInputFormat<Text, DoubleWritable, NullWritable,
> DoubleWritable>. I cannot imagine why a value would not be set for these
> vertexes, but I'll drop in some code to more stringently ensure value
> creation.
>
> Why would this begin to fail on a distributed deployment (multiple
> workers) but not with a single worker? The dataset is identical between the
> two executions.
>
>
> On Wed, Feb 13, 2013 at 2:35 PM, Alessandro Presta <alessandro@fb.com>wrote:
>
>>  Hi Zachary,
>>
>>  Are you running one of the examples or your own code?
>> It seems to me that a call to edge.getValue() is returning null, which
>> should never happen.
>>
>>  Alessandro
>>
>>   From: Zachary Hanif <zh4990@gmail.com>
>> Reply-To: "user@giraph.apache.org" <user@giraph.apache.org>
>> Date: Wednesday, February 13, 2013 11:29 AM
>> To: "user@giraph.apache.org" <user@giraph.apache.org>
>> Subject: Giraph/Netty issues on a cluster
>>
>>  (How embarrassing! I forgot a subject header in a previous attempt to
>> post this. Please reply to this thread, not the other.)
>>
>> Hi everyone,
>>
>> I am having some odd issues when trying to run a Giraph 0.2 job across my
>> CDH 3u3 cluster. After building the jar, and deploying it across the
>> cluster, I start to notice a handful of my nodes reporting the following
>> error:
>>
>>  2013-02-13 17:47:43,341 WARN
>>> org.apache.giraph.comm.netty.handler.ResponseClientHandler:
>>> exceptionCaught: Channel failed with remote address <EDITED_INTERNAL_DNS>/
>>> 10.2.0.16:30001
>>> java.lang.NullPointerException
>>>     at
>>> org.apache.giraph.vertex.EdgeListVertexBase.write(EdgeListVertexBase.java:106)
>>>     at
>>> org.apache.giraph.partition.SimplePartition.write(SimplePartition.java:169)
>>>     at
>>> org.apache.giraph.comm.requests.SendVertexRequest.writeRequest(SendVertexRequest.java:71)
>>>     at
>>> org.apache.giraph.comm.requests.WritableRequest.write(WritableRequest.java:127)
>>>     at
>>> org.apache.giraph.comm.netty.handler.RequestEncoder.encode(RequestEncoder.java:96)
>>>     at
>>> org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:61)
>>>     at
>>> org.jboss.netty.handler.execution.ExecutionHandler.handleDownstream(ExecutionHandler.java:185)
>>>     at org.jboss.netty.channel.Channels.write(Channels.java:712)
>>>     at org.jboss.netty.channel.Channels.write(Channels.java:679)
>>>     at
>>> org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:246)
>>>     at
>>> org.apache.giraph.comm.netty.NettyClient.sendWritableRequest(NettyClient.java:655)
>>>     at
>>> org.apache.giraph.comm.netty.NettyWorkerClient.sendWritableRequest(NettyWorkerClient.java:144)
>>>     at
>>> org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.doRequest(NettyWorkerClientRequestProcessor.java:425)
>>>     at
>>> org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.sendPartitionRequest(NettyWorkerClientRequestProcessor.java:195)
>>>     at
>>> org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.flush(NettyWorkerClientRequestProcessor.java:365)
>>>     at
>>> org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:190)
>>>     at
>>> org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:58)
>>>     at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>>     at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>>     at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>>     at java.lang.Thread.run(Thread.java:722)
>>>
>>
>> What would be causing this? All other Hadoop jobs run well on the
>> cluster, and when the Giraph job is run with only one worker, it completes
>> without any issues. When run with any number of workers >1, the above error
>> occurs. I have referenced this post<http://mail-archives.apache.org/mod_mbox/giraph-user/201209.mbox/%3CCAEQ6y7ShC4in-L73nR7aBizsPMRRfw9sfa8TMi3MyqML8VK0LQ@mail.gmail.com%3E>where
superficially similar issues were discussed, but the root cause
>> appears to be different, and suggested methods of resolution are not
>> panning out.
>>
>> As extra background, the 'remote address' changes, as the error cycles
>> through my available cluster nodes, and the failing workers do not seem to
>> favor one physical machine over another. Not all nodes present this issue,
>> only a handful per job. Is there soemthing simple that I am missing?
>>
>
>

Mime
View raw message