Return-Path: X-Original-To: apmail-giraph-user-archive@www.apache.org Delivered-To: apmail-giraph-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 429E3D2D9 for ; Wed, 13 Feb 2013 19:36:01 +0000 (UTC) Received: (qmail 55944 invoked by uid 500); 13 Feb 2013 19:36:01 -0000 Delivered-To: apmail-giraph-user-archive@giraph.apache.org Received: (qmail 55903 invoked by uid 500); 13 Feb 2013 19:36:01 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 55894 invoked by uid 99); 13 Feb 2013 19:36:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2013 19:36:01 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of prvs=575610103d=alessandro@fb.com designates 67.231.145.42 as permitted sender) Received: from [67.231.145.42] (HELO mx0a-00082601.pphosted.com) (67.231.145.42) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2013 19:35:53 +0000 Received: from pps.filterd (m0004348 [127.0.0.1]) by m0004348.ppops.net (8.14.5/8.14.5) with SMTP id r1DJXHDW018006 for ; Wed, 13 Feb 2013 11:35:32 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=fb.com; h=from : to : subject : date : message-id : in-reply-to : content-type : mime-version; s=facebook; bh=nD0CaBZ3FbjECx7TeZeUJaoS6j8EhPDdkqGs+8uJejo=; b=Voy+E6+zHrpOfRE4KNJf3gGrcmKMFjSiDqAlgGvqznwGt3HTX6BgIP9MEzv0uSbWcGyS oobQFi+53EnTwVT1hISPco2hMRAgFaycmfQhz4X+tW+Cb+tMEWtGLEj4lWEGOzEI9isL k02ZKatC6flGxAOlKBbHr2icgGoJOyY8Q/g= Received: from mail.thefacebook.com (prn1-cmdf-dc01-fw1-nat.corp.tfbnw.net [173.252.71.129] (may be forged)) by m0004348.ppops.net with ESMTP id 1ag3bpa13s-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=OK) for ; Wed, 13 Feb 2013 11:35:32 -0800 Received: from PRN-MBX01-5.TheFacebook.com ([169.254.4.188]) by PRN-CHUB06.TheFacebook.com ([fe80::f073:2a60:c133:4d69%12]) with mapi id 14.02.0328.011; Wed, 13 Feb 2013 11:35:31 -0800 From: Alessandro Presta To: "user@giraph.apache.org" Subject: Re: Giraph/Netty issues on a cluster Thread-Topic: Giraph/Netty issues on a cluster Thread-Index: AQHOCiBy5nKRnSIIxkK/tNm6GX1HJ5h4Lk6A Date: Wed, 13 Feb 2013 19:35:30 +0000 Message-ID: <566834A0CA4ED742A53644FF331260D43591196E@PRN-MBX01-5.TheFacebook.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [192.168.16.4] Content-Type: multipart/alternative; boundary="_000_566834A0CA4ED742A53644FF331260D43591196EPRNMBX015TheFac_" MIME-Version: 1.0 X-Proofpoint-Spam-Reason: safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.9.8327,1.0.431,0.0.0000 definitions=2013-02-13_03:2013-02-13,2013-02-13,1970-01-01 signatures=0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_566834A0CA4ED742A53644FF331260D43591196EPRNMBX015TheFac_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi Zachary, Are you running one of the examples or your own code? It seems to me that a call to edge.getValue() is returning null, which shou= ld never happen. Alessandro From: Zachary Hanif > Reply-To: "user@giraph.apache.org" > Date: Wednesday, February 13, 2013 11:29 AM To: "user@giraph.apache.org" > Subject: Giraph/Netty issues on a cluster (How embarrassing! I forgot a subject header in a previous attempt to post = this. Please reply to this thread, not the other.) Hi everyone, I am having some odd issues when trying to run a Giraph 0.2 job across my C= DH 3u3 cluster. After building the jar, and deploying it across the cluster= , I start to notice a handful of my nodes reporting the following error: 2013-02-13 17:47:43,341 WARN org.apache.giraph.comm.netty.handler.ResponseC= lientHandler: exceptionCaught: Channel failed with remote address /10.2.0.16:30001 java.lang.NullPointerException at org.apache.giraph.vertex.EdgeListVertexBase.write(EdgeListVertexBase= .java:106) at org.apache.giraph.partition.SimplePartition.write(SimplePartition.ja= va:169) at org.apache.giraph.comm.requests.SendVertexRequest.writeRequest(SendV= ertexRequest.java:71) at org.apache.giraph.comm.requests.WritableRequest.write(WritableReques= t.java:127) at org.apache.giraph.comm.netty.handler.RequestEncoder.encode(RequestEn= coder.java:96) at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstrea= m(OneToOneEncoder.java:61) at org.jboss.netty.handler.execution.ExecutionHandler.handleDownstream(= ExecutionHandler.java:185) at org.jboss.netty.channel.Channels.write(Channels.java:712) at org.jboss.netty.channel.Channels.write(Channels.java:679) at org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:2= 46) at org.apache.giraph.comm.netty.NettyClient.sendWritableRequest(NettyCl= ient.java:655) at org.apache.giraph.comm.netty.NettyWorkerClient.sendWritableRequest(N= ettyWorkerClient.java:144) at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.doReq= uest(NettyWorkerClientRequestProcessor.java:425) at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.sendP= artitionRequest(NettyWorkerClientRequestProcessor.java:195) at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.flush= (NettyWorkerClientRequestProcessor.java:365) at org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallabl= e.java:190) at org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallabl= e.java:58) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor= .java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecuto= r.java:603) at java.lang.Thread.run(Thread.java:722) What would be causing this? All other Hadoop jobs run well on the cluster, = and when the Giraph job is run with only one worker, it completes without a= ny issues. When run with any number of workers >1, the above error occurs. = I have referenced this post where superficially similar issues were discussed, but the= root cause appears to be different, and suggested methods of resolution ar= e not panning out. As extra background, the 'remote address' changes, as the error cycles thro= ugh my available cluster nodes, and the failing workers do not seem to favo= r one physical machine over another. Not all nodes present this issue, only= a handful per job. Is there soemthing simple that I am missing? --_000_566834A0CA4ED742A53644FF331260D43591196EPRNMBX015TheFac_ Content-Type: text/html; charset="us-ascii" Content-ID: <493C262922F89541B77B4644F4E6CD2B@fb.com> Content-Transfer-Encoding: quoted-printable
Hi Zachary,

Are you running one of the examples or your own code?
It seems to me that a call to edge.getValue() is returning null, which= should never happen.

Alessandro

From: Zachary Hanif <zh4990@gmail.com>
Reply-To: "user@giraph.apache.org" <user@giraph.apache.org>
Date: Wednesday, February 13, 2013 = 11:29 AM
To: "user@giraph.apache.org" <user@giraph.apache.org>
Subject: Giraph/Netty issues on a c= luster

(How embarrassing! I forgot a subject header in a previous attempt to = post this. Please reply to this thread, not the other.)

Hi everyone,

I am having some odd issues when trying to run a Giraph 0.2 job across my C= DH 3u3 cluster. After building the jar, and deploying it across the cluster= , I start to notice a handful of my nodes reporting the following error:
2013-02-13 17:47:43,341 WARN org.apache.giraph.comm.netty.handler.ResponseC= lientHandler: exceptionCaught: Channel failed with remote address <EDITE= D_INTERNAL_DNS>/10.= 2.0.16:30001
java.lang.NullPointerException
    at org.apache.giraph.vertex.EdgeListVertexBase.write(Edg= eListVertexBase.java:106)
    at org.apache.giraph.partition.SimplePartition.write(Sim= plePartition.java:169)
    at org.apache.giraph.comm.requests.SendVertexRequest.wri= teRequest(SendVertexRequest.java:71)
    at org.apache.giraph.comm.requests.WritableRequest.write= (WritableRequest.java:127)
    at org.apache.giraph.comm.netty.handler.RequestEncoder.e= ncode(RequestEncoder.java:96)
    at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.= handleDownstream(OneToOneEncoder.java:61)
    at org.jboss.netty.handler.execution.ExecutionHandler.ha= ndleDownstream(ExecutionHandler.java:185)
    at org.jboss.netty.channel.Channels.write(Channels.java:= 712)
    at org.jboss.netty.channel.Channels.write(Channels.java:= 679)
    at org.jboss.netty.channel.AbstractChannel.write(Abstrac= tChannel.java:246)
    at org.apache.giraph.comm.netty.NettyClient.sendWritable= Request(NettyClient.java:655)
    at org.apache.giraph.comm.netty.NettyWorkerClient.sendWr= itableRequest(NettyWorkerClient.java:144)
    at org.apache.giraph.comm.netty.NettyWorkerClientRequest= Processor.doRequest(NettyWorkerClientRequestProcessor.java:425)
    at org.apache.giraph.comm.netty.NettyWorkerClientRequest= Processor.sendPartitionRequest(NettyWorkerClientRequestProcessor.java:195)<= br>     at org.apache.giraph.comm.netty.NettyWorkerClientRequest= Processor.flush(NettyWorkerClientRequestProcessor.java:365)
    at org.apache.giraph.worker.InputSplitsCallable.call(Inp= utSplitsCallable.java:190)
    at org.apache.giraph.worker.InputSplitsCallable.call(Inp= utSplitsCallable.java:58)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureT= ask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:1= 66)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Thr= eadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Th= readPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)

What would be causing this? All other Hadoop jobs run well on the cluster, = and when the Giraph job is run with only one worker, it completes without a= ny issues. When run with any number of workers >1, the above error occur= s. I have referenced this post where superficially similar issues were discussed, but the root c= ause appears to be different, and suggested methods of resolution are not p= anning out.

As extra background, the 'remote address' changes, as the error cycles thro= ugh my available cluster nodes, and the failing workers do not seem to favo= r one physical machine over another. Not all nodes present this issue, only= a handful per job. Is there soemthing simple that I am missing?
--_000_566834A0CA4ED742A53644FF331260D43591196EPRNMBX015TheFac_--