Mailing-List: contact user-help@giraph.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@giraph.apache.org
Received-SPF: pass (athena.apache.org: domain of
 prvs=575610103d=alessandro@fb.com designates 67.231.145.42 as permitted
 sender)
From: Alessandro Presta <alessandro@fb.com>
To: "user@giraph.apache.org" <user@giraph.apache.org>
Subject: Re: Giraph/Netty issues on a cluster
Thread-Topic: Giraph/Netty issues on a cluster
Thread-Index: AQHOCiBy5nKRnSIIxkK/tNm6GX1HJ5h4Lk6A
Date: Wed, 13 Feb 2013 19:35:30 +0000
Message-ID: 
 <566834A0CA4ED742A53644FF331260D43591196E@PRN-MBX01-5.TheFacebook.com>
In-Reply-To: 
 <CANk+w=zceBrP-e-CpAMfOjts9D8PTCAnPz+gkiEnm2121bx99Q@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_566834A0CA4ED742A53644FF331260D43591196EPRNMBX015TheFac_"
MIME-Version: 1.0

--_000_566834A0CA4ED742A53644FF331260D43591196EPRNMBX015TheFac_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi Zachary,

Are you running one of the examples or your own code?
It seems to me that a call to edge.getValue() is returning null, which shou=
ld never happen.

Alessandro

From: Zachary Hanif <zh4990@gmail.com<mailto:zh4990@gmail.com>>
Reply-To: "user@giraph.apache.org<mailto:user@giraph.apache.org>" <user@gir=
aph.apache.org<mailto:user@giraph.apache.org>>
Date: Wednesday, February 13, 2013 11:29 AM
To: "user@giraph.apache.org<mailto:user@giraph.apache.org>" <user@giraph.ap=
ache.org<mailto:user@giraph.apache.org>>
Subject: Giraph/Netty issues on a cluster

(How embarrassing! I forgot a subject header in a previous attempt to post =
this. Please reply to this thread, not the other.)

Hi everyone,

I am having some odd issues when trying to run a Giraph 0.2 job across my C=
DH 3u3 cluster. After building the jar, and deploying it across the cluster=
, I start to notice a handful of my nodes reporting the following error:

2013-02-13 17:47:43,341 WARN org.apache.giraph.comm.netty.handler.ResponseC=
lientHandler: exceptionCaught: Channel failed with remote address <EDITED_I=
NTERNAL_DNS>/10.2.0.16:30001<http://10.2.0.16:30001>
java.lang.NullPointerException
    at org.apache.giraph.vertex.EdgeListVertexBase.write(EdgeListVertexBase=
.java:106)
    at org.apache.giraph.partition.SimplePartition.write(SimplePartition.ja=
va:169)
    at org.apache.giraph.comm.requests.SendVertexRequest.writeRequest(SendV=
ertexRequest.java:71)
    at org.apache.giraph.comm.requests.WritableRequest.write(WritableReques=
t.java:127)
    at org.apache.giraph.comm.netty.handler.RequestEncoder.encode(RequestEn=
coder.java:96)
    at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstrea=
m(OneToOneEncoder.java:61)
    at org.jboss.netty.handler.execution.ExecutionHandler.handleDownstream(=
ExecutionHandler.java:185)
    at org.jboss.netty.channel.Channels.write(Channels.java:712)
    at org.jboss.netty.channel.Channels.write(Channels.java:679)
    at org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:2=
46)
    at org.apache.giraph.comm.netty.NettyClient.sendWritableRequest(NettyCl=
ient.java:655)
    at org.apache.giraph.comm.netty.NettyWorkerClient.sendWritableRequest(N=
ettyWorkerClient.java:144)
    at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.doReq=
uest(NettyWorkerClientRequestProcessor.java:425)
    at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.sendP=
artitionRequest(NettyWorkerClientRequestProcessor.java:195)
    at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.flush=
(NettyWorkerClientRequestProcessor.java:365)
    at org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallabl=
e.java:190)
    at org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallabl=
e.java:58)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor=
.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecuto=
r.java:603)
    at java.lang.Thread.run(Thread.java:722)

What would be causing this? All other Hadoop jobs run well on the cluster, =
and when the Giraph job is run with only one worker, it completes without a=
ny issues. When run with any number of workers >1, the above error occurs. =
I have referenced this post<http://mail-archives.apache.org/mod_mbox/giraph=
-user/201209.mbox/%3CCAEQ6y7ShC4in-L73nR7aBizsPMRRfw9sfa8TMi3MyqML8VK0LQ@ma=
il.gmail.com%3E> where superficially similar issues were discussed, but the=
 root cause appears to be different, and suggested methods of resolution ar=
e not panning out.

As extra background, the 'remote address' changes, as the error cycles thro=
ugh my available cluster nodes, and the failing workers do not seem to favo=
r one physical machine over another. Not all nodes present this issue, only=
 a handful per job. Is there soemthing simple that I am missing?

--_000_566834A0CA4ED742A53644FF331260D43591196EPRNMBX015TheFac_
Content-Type: text/html; charset="us-ascii"
Content-ID: <493C262922F89541B77B4644F4E6CD2B@fb.com>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-fami=
ly: Calibri, sans-serif; ">
<div>Hi Zachary,</div>
<div><br>
</div>
<div>Are you running one of the examples or your own code?</div>
<div>It seems to me that a call to edge.getValue() is returning null, which=
 should never happen.</div>
<div><br>
</div>
<div>Alessandro</div>
<div><br>
</div>
<span id=3D"OLK_SRC_BODY_SECTION">
<div style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:b=
lack; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM:=
 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid;=
 BORDER-RIGHT: medium none; PADDING-TOP: 3pt">
<span style=3D"font-weight:bold">From: </span>Zachary Hanif &lt;<a href=3D"=
mailto:zh4990@gmail.com">zh4990@gmail.com</a>&gt;<br>
<span style=3D"font-weight:bold">Reply-To: </span>&quot;<a href=3D"mailto:u=
ser@giraph.apache.org">user@giraph.apache.org</a>&quot; &lt;<a href=3D"mail=
to:user@giraph.apache.org">user@giraph.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Wednesday, February 13, 2013 =
11:29 AM<br>
<span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:user@gi=
raph.apache.org">user@giraph.apache.org</a>&quot; &lt;<a href=3D"mailto:use=
r@giraph.apache.org">user@giraph.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Subject: </span>Giraph/Netty issues on a c=
luster<br>
</div>
<div><br>
</div>
<div>
<div>(How embarrassing! I forgot a subject header in a previous attempt to =
post this. Please reply to this thread, not the other.)<br>
<br>
Hi everyone,<br>
<br>
I am having some odd issues when trying to run a Giraph 0.2 job across my C=
DH 3u3 cluster. After building the jar, and deploying it across the cluster=
, I start to notice a handful of my nodes reporting the following error:<br=
>
<br>
<blockquote style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204=
,204,204);padding-left:1ex" class=3D"gmail_quote">
2013-02-13 17:47:43,341 WARN org.apache.giraph.comm.netty.handler.ResponseC=
lientHandler: exceptionCaught: Channel failed with remote address &lt;EDITE=
D_INTERNAL_DNS&gt;/<a href=3D"http://10.2.0.16:30001" target=3D"_blank">10.=
2.0.16:30001</a><br>
java.lang.NullPointerException<br>
&nbsp;&nbsp;&nbsp; at org.apache.giraph.vertex.EdgeListVertexBase.write(Edg=
eListVertexBase.java:106)<br>
&nbsp;&nbsp;&nbsp; at org.apache.giraph.partition.SimplePartition.write(Sim=
plePartition.java:169)<br>
&nbsp;&nbsp;&nbsp; at org.apache.giraph.comm.requests.SendVertexRequest.wri=
teRequest(SendVertexRequest.java:71)<br>
&nbsp;&nbsp;&nbsp; at org.apache.giraph.comm.requests.WritableRequest.write=
(WritableRequest.java:127)<br>
&nbsp;&nbsp;&nbsp; at org.apache.giraph.comm.netty.handler.RequestEncoder.e=
ncode(RequestEncoder.java:96)<br>
&nbsp;&nbsp;&nbsp; at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.=
handleDownstream(OneToOneEncoder.java:61)<br>
&nbsp;&nbsp;&nbsp; at org.jboss.netty.handler.execution.ExecutionHandler.ha=
ndleDownstream(ExecutionHandler.java:185)<br>
&nbsp;&nbsp;&nbsp; at org.jboss.netty.channel.Channels.write(Channels.java:=
712)<br>
&nbsp;&nbsp;&nbsp; at org.jboss.netty.channel.Channels.write(Channels.java:=
679)<br>
&nbsp;&nbsp;&nbsp; at org.jboss.netty.channel.AbstractChannel.write(Abstrac=
tChannel.java:246)<br>
&nbsp;&nbsp;&nbsp; at org.apache.giraph.comm.netty.NettyClient.sendWritable=
Request(NettyClient.java:655)<br>
&nbsp;&nbsp;&nbsp; at org.apache.giraph.comm.netty.NettyWorkerClient.sendWr=
itableRequest(NettyWorkerClient.java:144)<br>
&nbsp;&nbsp;&nbsp; at org.apache.giraph.comm.netty.NettyWorkerClientRequest=
Processor.doRequest(NettyWorkerClientRequestProcessor.java:425)<br>
&nbsp;&nbsp;&nbsp; at org.apache.giraph.comm.netty.NettyWorkerClientRequest=
Processor.sendPartitionRequest(NettyWorkerClientRequestProcessor.java:195)<=
br>
&nbsp;&nbsp;&nbsp; at org.apache.giraph.comm.netty.NettyWorkerClientRequest=
Processor.flush(NettyWorkerClientRequestProcessor.java:365)<br>
&nbsp;&nbsp;&nbsp; at org.apache.giraph.worker.InputSplitsCallable.call(Inp=
utSplitsCallable.java:190)<br>
&nbsp;&nbsp;&nbsp; at org.apache.giraph.worker.InputSplitsCallable.call(Inp=
utSplitsCallable.java:58)<br>
&nbsp;&nbsp;&nbsp; at java.util.concurrent.FutureTask$Sync.innerRun(FutureT=
ask.java:334)<br>
&nbsp;&nbsp;&nbsp; at java.util.concurrent.FutureTask.run(FutureTask.java:1=
66)<br>
&nbsp;&nbsp;&nbsp; at java.util.concurrent.ThreadPoolExecutor.runWorker(Thr=
eadPoolExecutor.java:1110)<br>
&nbsp;&nbsp;&nbsp; at java.util.concurrent.ThreadPoolExecutor$Worker.run(Th=
readPoolExecutor.java:603)<br>
&nbsp;&nbsp;&nbsp; at java.lang.Thread.run(Thread.java:722)<br>
</blockquote>
<br>
What would be causing this? All other Hadoop jobs run well on the cluster, =
and when the Giraph job is run with only one worker, it completes without a=
ny issues. When run with any number of workers &gt;1, the above error occur=
s. I have referenced&nbsp;<a href=3D"http://mail-archives.apache.org/mod_mb=
ox/giraph-user/201209.mbox/%3CCAEQ6y7ShC4in-L73nR7aBizsPMRRfw9sfa8TMi3MyqML=
8VK0LQ@mail.gmail.com%3E" target=3D"_blank">this
 post</a> where superficially similar issues were discussed, but the root c=
ause appears to be different, and suggested methods of resolution are not p=
anning out.<br>
<br>
As extra background, the 'remote address' changes, as the error cycles thro=
ugh my available cluster nodes, and the failing workers do not seem to favo=
r one physical machine over another. Not all nodes present this issue, only=
 a handful per job. Is there soemthing
 simple that I am missing? </div>
</div>
</span>
</body>
</html>

--_000_566834A0CA4ED742A53644FF331260D43591196EPRNMBX015TheFac_--