Return-Path: X-Original-To: apmail-giraph-user-archive@www.apache.org Delivered-To: apmail-giraph-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 52C22CFEF for ; Fri, 29 Jun 2012 18:36:48 +0000 (UTC) Received: (qmail 31121 invoked by uid 500); 29 Jun 2012 18:36:48 -0000 Delivered-To: apmail-giraph-user-archive@giraph.apache.org Received: (qmail 30956 invoked by uid 500); 29 Jun 2012 18:36:47 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 30941 invoked by uid 99); 29 Jun 2012 18:36:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2012 18:36:47 +0000 X-ASF-Spam-Status: No, hits=-2.8 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_HI,SPF_PASS,TVD_FW_GRAPHIC_NAME_MID X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ytian@us.ibm.com designates 32.97.182.138 as permitted sender) Received: from [32.97.182.138] (HELO e8.ny.us.ibm.com) (32.97.182.138) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2012 18:36:34 +0000 Received: from /spool/local by e8.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 29 Jun 2012 14:36:12 -0400 Received: from d01relay03.pok.ibm.com (9.56.227.235) by e8.ny.us.ibm.com (192.168.1.108) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Fri, 29 Jun 2012 14:35:32 -0400 Received: from d01av05.pok.ibm.com (d01av05.pok.ibm.com [9.56.224.195]) by d01relay03.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q5TIZVFO391342 for ; Fri, 29 Jun 2012 14:35:31 -0400 Received: from d01av05.pok.ibm.com (loopback [127.0.0.1]) by d01av05.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q5TIZVWg004693 for ; Fri, 29 Jun 2012 14:35:31 -0400 Received: from d01ml604.pok.ibm.com (d01ml604.pok.ibm.com [9.56.227.90]) by d01av05.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id q5TIZVMA004690 for ; Fri, 29 Jun 2012 14:35:31 -0400 In-Reply-To: References: <4FEBC7D5.9030709@apache.org> <4FEC13AD.7030603@apache.org> Subject: Re: wierd communication errors X-KeepSent: 74D19E1D:01CB86B4-85257A2C:006458B6; type=4; name=$KeepSent To: user@giraph.apache.org X-Mailer: Lotus Notes Release 8.5.1FP5 SHF29 November 12, 2010 Message-ID: From: Yuanyuan Tian Date: Fri, 29 Jun 2012 11:35:26 -0700 X-MIMETrack: Serialize by Router on D01ML604/01/M/IBM(Release 8.5.3HF266 | January 13, 2012) at 06/29/2012 14:35:30 MIME-Version: 1.0 Content-type: multipart/related; Boundary="0__=0ABBF0BFDFF7DE268f9e8a93df938690918c0ABBF0BFDFF7DE26" x-cbid: 12062918-9360-0000-0000-000007EF2745 --0__=0ABBF0BFDFF7DE268f9e8a93df938690918c0ABBF0BFDFF7DE26 Content-type: multipart/alternative; Boundary="1__=0ABBF0BFDFF7DE268f9e8a93df938690918c0ABBF0BFDFF7DE26" --1__=0ABBF0BFDFF7DE268f9e8a93df938690918c0ABBF0BFDFF7DE26 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: quoted-printable Hi Avery, I have got better understanding of the problem now. I think I hit the scalability limit of Giraph. I was running 90 workers on a 16 nodes clu= ster (each node can run 7 concurrent mappers). So, on average each node ran = 5 or 6 workers, with each worker maintaining 89 RPC connections. Plus, each worker in my job was sending a lot of messages in each iteration. As a result, the RPC connections got very unstable. When I tried to reduce t= he number of workers to 40 and number of concurrent mappers each node can = run to 4, the job can run without any problem. I think my experience revealed two limitations of the current Giraph: - Scalability: suppose we have n workers in a Giraph job, then each wor= ker will maintain (n-1) RPC connections. There are totally n*(n-1) RPC connections in the job. As n increases, the number of RPC connections quickly grows out of the capacity of the current giraph system. - Fault Tolerance: when "connection reset by peer" or "broken pipe" happens, the job just hangs, then eventually dies after timeout. There = is no re-establishment of a connection or automatic restart of a failed worker. I am very curious whether there is any plan to address these two limitations. BTW: I checked out the latest code from trunk. I did see some code usin= g netty. But the BasicRPCCommunications is still using Hadoop RPC. Is the= re a nob or something I need to turn on to use netty? Yuanyuan From: Yuanyuan Tian/Almaden/IBM To: user@giraph.apache.org Cc: user@giraph.apache.org Date: 06/28/2012 10:16 AM Subject: Re: wierd communication errors I can try the netty version then. But what is the cause of reset by pee= r? Time out? And if it happens, how can I reestablish the connection? I ca= n add some code to check the connection first and to reestablish the connection if reset by peer before calling putVertexIdMessagesList. Yuanyuan From: Avery Ching To: user@giraph.apache.org Date: 06/28/2012 01:20 AM Subject: Re: wierd communication errors In my testing, I found the netty implementation of Giraph (trunk) to be= more stable than Hadoop RPC. But you can't do too much (other than reestablish the connection) when the connection is reset by peer. Avery On 6/28/12 12:29 AM, Yuanyuan Tian wrote: I want to make a correction about the errors. The error should be= as follows. The errors in my previous email are from my added debug message. But the problem is the same, somehow some connection was= reset by peer. I did more tries. Occasionally, my job can actuall= y run without a problem, then more times the job fails because of t= his connection reset problem. I really don't have a clue what the problem is. Yuanyuan java.lang.IllegalStateException: run: Caught an unrecoverable exception flush: Got ExecutionException at org.apache.giraph.graph.GraphMapper.run (GraphMapper.java:859) at org.apache.hadoop.mapred.MapTask.runNewMapper (MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run (MapTask.java:369) at org.apache.hadoop.mapred.Child$4.run (Child.java:259) at java.security.AccessController.doPrivileged(Na= tive Method) at javax.security.auth.Subject.doAs(Subject.java:= 396) at org.apache.hadoop.security.UserGroupInformation.doAs (UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main (Child.java:253) Caused by: java.lang.IllegalStateException: flush: Got ExecutionException at org.apache.giraph.comm.BasicRPCCommunications.flush (BasicRPCCommunications.java:1085) at org.apache.giraph.graph.BspServiceWorker.finishSuperstep (BspServiceWorker.java:1080) at org.apache.giraph.graph.GraphMapper.map (GraphMapper.java:806) at org.apache.giraph.graph.GraphMapper.run (GraphMapper.java:850) ... 7 more Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.IOException: Call to idp33.almaden.ibm.com/172.16.0.33:30054 failed on local exception= : java.io.IOException: Connection reset by peer at java.util.concurrent.FutureTask$Sync.innerGet (FutureTask.java:222) at java.util.concurrent.FutureTask.get (FutureTask.java:83) at org.apache.giraph.comm.BasicRPCCommunications.flush (BasicRPCCommunications.java:1080) ... 10 more Caused by: java.lang.RuntimeException: java.io.IOException: Call = to idp33.almaden.ibm.com/172.16.0.33:30054 failed on local exception= : java.io.IOException: Connection reset by peer at org.apache.giraph.comm.BasicRPCCommunications $PeerFlushExecutor.run(BasicRPCCommunications.java:379) at java.util.concurrent.Executors $RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun (FutureTask.java:303) at java.util.concurrent.FutureTask.run (FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor $Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker= .run (ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Call to idp33.almaden.ibm.com/172.16.0.33:30054 failed on local exception= : java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException (Client.java:1065) at org.apache.hadoop.ipc.Client.call (Client.java:1033) at org.apache.hadoop.ipc.RPC$Invoker.invoke (RPC.java:224) at $Proxy3.putVertexIdMessagesList(Unknown Source= ) at org.apache.giraph.comm.BasicRPCCommunications $PeerFlushExecutor.run(BasicRPCCommunications.java:339) ... 6 more Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method)= at sun.nio.ch.SocketDispatcher.read (SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer (IOUtil.java:202) at sun.nio.ch.IOUtil.read(IOUtil.java:175) at sun.nio.ch.SocketChannelImpl.read (SocketChannelImpl.java:243) at org.apache.hadoop.net.SocketInputStream $Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO= (SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read (SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read (SocketInputStream.java:128) at java.io.FilterInputStream.read (FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection $PingInputStream.read(Client.java:343) at java.io.BufferedInputStream.fill (BufferedInputStream.java:218) at java.io.BufferedInputStream.read (BufferedInputStream.java:237) at java.io.DataInputStream.readInt (DataInputStream.java:370) at org.apache.hadoop.ipc.Client $Connection.receiveResponse(Client.java:767) at org.apache.hadoop.ipc.Client$Connection.run (Client.java:712) From: Yuanyuan Tian/Almaden/IBM@IBMUS To: user@giraph.apache.org Cc: user@giraph.apache.org Date: 06/27/2012 10:02 PM Subject: Re: wierd communication errors What do you mean using netty? I am not aware that Giraph is using= netty. I am just using what ever the default giraph release 1.0 = is using. Yuanyuan From: Avery Ching To: user@giraph.apache.org Date: 06/27/2012 07:57 PM Subject: Re: wierd communication errors Same issue using netty as well? On 6/27/12 6:14 PM, Yuanyuan Tian wrote: Hi, I was running a giraph job where I constantly got the following communication related errors. The symptom is that in super step 0= , most of the workers succeeded but a few of the workers produced t= he errors below, the machines that caused the connection reset are different in each failed worker. To rule out the probability of t= he cluster setup error, I also ran a different job and it worked fin= e. So, the error must be caused by this particular giraph job. My gi= raph job is just normal message propagation type of job, except that t= he message is not a of a unique type. Therefore, I defined a special= message type (also copied in this email) that incorporates two different types of messages: integer message and double array message. I have tried all day but still couldn't ping point the source of the bug. Can anyone give me some hints on what may have= caused this error? Thanks a lot, java.lang.IllegalStateException: run: Caught an unrecoverable exception flush: Got ExecutionException at org.apache.giraph.graph.GraphMapper.run (GraphMapper.java:859) at org.apache.hadoop.mapred.MapTask.runNewMapper (MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run (MapTask.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:= 259) at java.security.AccessController.doPrivileged(Nati= ve Method) at javax.security.auth.Subject.doAs(Subject.java:39= 6) at org.apache.hadoop.security.UserGroupInformation.= doAs (UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:2= 53) Caused by: java.lang.IllegalStateException: flush: Got ExecutionException at org.apache.giraph.comm.BasicRPCCommunications.fl= ush (BasicRPCCommunications.java:1082) at org.apache.giraph.graph.BspServiceWorker.finishSuperstep (BspServiceWorker.java:1080) at org.apache.giraph.graph.GraphMapper.map (GraphMapper.java:806) at org.apache.giraph.graph.GraphMapper.run (GraphMapper.java:850) ... 7 more Caused by: java.util.concurrent.ExecutionException: java.lang.reflect.UndeclaredThrowableException at java.util.concurrent.FutureTask$Sync.innerGet (FutureTask.java:222) at java.util.concurrent.FutureTask.get (FutureTask.java:83) at org.apache.giraph.comm.BasicRPCCommunications.fl= ush (BasicRPCCommunications.java:1077) ... 10 more Caused by: java.lang.reflect.UndeclaredThrowableException at $Proxy3.getName(Unknown Source) at org.apache.giraph.comm.BasicRPCCommunications $PeerFlushExecutor.run(BasicRPCCommunications.java:335) at java.util.concurrent.Executors$RunnableAdapter.c= all (Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun (FutureTask.java:303) at java.util.concurrent.FutureTask.run (FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor $Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.r= un (ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Call to idp35.almaden.ibm.com/172.16.0.35:30083 failed on local exception= : java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException (Client.java:1065) at org.apache.hadoop.ipc.Client.call(Client.java:10= 33) at org.apache.hadoop.ipc.RPC$Invoker.invoke (RPC.java:224) ... 8 more Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read (SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer (IOUtil.java:202) at sun.nio.ch.IOUtil.read(IOUtil.java:175) at sun.nio.ch.SocketChannelImpl.read (SocketChannelImpl.java:243) at org.apache.hadoop.net.SocketInputStream $Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO (SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read (SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read (SocketInputStream.java:128) at java.io.FilterInputStream.read (FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection $PingInputStream.read(Client.java:343) at java.io.BufferedInputStream.fill (BufferedInputStream.java:218) at java.io.BufferedInputStream.read (BufferedInputStream.java:237) at java.io.DataInputStream.readInt (DataInputStream.java:370) at org.apache.hadoop.ipc.Client $Connection.receiveResponse(Client.java:767) at org.apache.hadoop.ipc.Client$Connection.run (Client.java:712) My special messge type: public class MyMessageWritable implements Writable{ public byte msgType=3D0; public long vertexID=3D-1; public double[] arrayMsg=3Dnull; public int intMsg=3D-1; public MyMessageWritable () { } public MyMessageWritable (long id, byte tp, int msg) { vertexID=3Did; msgType=3Dtp; intMsg=3Dmsg; } public MyMessageWritable (long id, byte tp, double[] arr) { vertexID=3Did; msgType=3Dtp; arrayMsg=3Darr; } @Override public void readFields(DataInput in) throws IOException { vertexID=3Din.readLong(); msgType=3Din.readByte(); switch(msgType) { case 1: case 4: intMsg=3Din.readInt(); break; case 2: case 3: if(arrayMsg=3D=3Dnull) arrayMsg=3Dnew double[MyVertex.K]; for(int i=3D0; i

Hi Avery,

I have got better understanding of= the problem now. I think I hit the scalability limit of Giraph. I was = running 90 workers on a 16 nodes cluster (each node can run 7 concurren= t mappers). So, on average each node ran 5 or 6 workers, with each work= er maintaining 89 RPC connections. Plus, each worker in my job was send= ing a lot of messages in each iteration. As a result, the RPC connectio= ns got very unstable. When I tried to reduce the number of workers to 4= 0 and number of concurrent mappers each node can run to 4, the job can = run without any problem.

I think my experience revealed two= limitations of the current Giraph:
- Scalability: suppose we have n w= orkers in a Giraph job, then each worker will maintain (n-1) RPC connec= tions. There are totally n*(n-1) RPC connections in the job. As n incre= ases, the number of RPC connections quickly grows out of the capacity o= f the current giraph system.
- Fault Tolerance: when "conn= ection reset by peer" or "broken pipe" happens, the job = just hangs, then eventually dies after timeout. There is no re-establis= hment of a connection or automatic restart of a failed worker. <= br>
I am very curious whether there is= any plan to address these two limitations.

BTW: I checked out the latest code= from trunk. I did see some code using netty. But the BasicRPCCommunications is still using Hadoop RPC. Is there a nob = or something I need to turn on to use netty?

Yuanyuan

3D"InactiveYuanyuan Tian---06/28/2012 10:16:10 AM---I can try the netty= version then. But what is the cause of reset by peer? Time out? And if= it happen

From: Yuanyuan Tian/Almaden/IBM
To: user@giraph.apache.org
Cc: user@giraph.apache.org
Date: 06/28/2012 10:16 AM
Subject: = Re: wierd communication errors




I can try the netty version then. = But what is the cause of reset by peer? Time out? And if it happens, ho= w can I reestablish the connection? I can add some code to check the co= nnection first and to reestablish the connection if reset by peer befor= e calling putVertexIdMessagesList.

Yuanyuan


3D"InactiveAvery Ching ---06/28/2012 01:20:48 AM---In my testing, I fou= nd the netty implementation of Giraph (trunk) to be  more stable t= han Hadoop RPC

From: Avery Ching <aching@apache.org>=

To: user@giraph.apache.org
Date: 06/28/2012 01:20 AM
Subject: = Re: wierd communication errors





In my testing, I found the netty implem= entation of Giraph (trunk) to be more stable than Hadoop RPC.  But= you can't do too much (other than reestablish the connection) when the= connection is reset by peer.

Avery

On 6/28/12 12:29 AM, Yuanyuan Tian wrote:
    I= want to make a correction about the errors. The error should be as fol= lows. The errors in my previous email are from my added debug message. = But the problem is the same, somehow some connection was reset by peer.= I did more tries. Occasionally, my job can actually run without a prob= lem, then more times the job fails because of this connection reset pro= blem.  I really don't have a clue what the problem is.

    Yuanyuan
     

    java.lang.IllegalStateException: run: Caught an unrecoverable exception= flush: Got ExecutionException
                   at org.apache.g= iraph.graph.GraphMapper.run(GraphMapper.java:859)
                   at org.apache.h= adoop.mapred.MapTask.runNewMapper(MapTask.java:763)
                   at org.apache.h= adoop.mapred.MapTask.run(MapTask.java:369)
                   at org.apache.h= adoop.mapred.Child$4.run(Child.java:259)
                   at java.securit= y.AccessController.doPrivileged(Native Method)
                   at javax.securi= ty.auth.Subject.doAs(Subject.java:396)
                   at org.apache.h= adoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059= )
                   at org.apache.h= adoop.mapred.Child.main(Child.java:253)
    Caused by: java.lang.IllegalStateException: flush: Got ExecutionExcepti= on
                   at org.apache.g= iraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:108= 5)
                   at org.apache.g= iraph.graph.BspServiceWorker.finishSuperstep(BspServiceWorker.java:1080= )
                   at org.apache.g= iraph.graph.GraphMapper.map(GraphMapper.java:806)
                   at org.apache.g= iraph.graph.GraphMapper.run(GraphMapper.java:850)
                   ... 7 more
    Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeEx= ception: java.io.IOException: Call to idp33.almaden.ibm.com/172.16.0.33= :30054 failed on local exception: java.io.IOException: Connection reset= by peer
                   at java.util.co= ncurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
                   at java.util.co= ncurrent.FutureTask.get(FutureTask.java:83)
                   at org.apache.g= iraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:108= 0)
                   ... 10 more
    = Caused by: java.lang.RuntimeException: java.io.IOException: Call to idp= 33.almaden.ibm.com/172.16.0.33:30054 failed on local exception: java.io= .IOException: Connection reset by peer
                   at org.apache.g= iraph.comm.BasicRPCCommunications$PeerFlushExecutor.run(BasicRPCCommuni= cations.java:379)
                   at java.util.co= ncurrent.Executors$RunnableAdapter.call(Executors.java:441)
                   at java.util.co= ncurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
                   at java.util.co= ncurrent.FutureTask.run(FutureTask.java:138)
                   at java.util.co= ncurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)=
                   at java.util.co= ncurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    =                at java.lang.Th= read.run(Thread.java:662)
    Caused by: java.io.IOException: Call to idp33.almaden.ibm.com/172.16.0.= 33:30054 failed on local exception: java.io.IOException: Connection res= et by peer
                   at org.apache.h= adoop.ipc.Client.wrapException(Client.java:1065)
                   at org.apache.h= adoop.ipc.Client.call(Client.java:1033)
                   at org.apache.h= adoop.ipc.RPC$Invoker.invoke(RPC.java:224)
                   at $Proxy3.putV= ertexIdMessagesList(Unknown Source)
                   at org.apache.g= iraph.comm.BasicRPCCommunications$PeerFlushExecutor.run(BasicRPCCommuni= cations.java:339)
                   ... 6 more
    Caused by: java.io.IOException: Connection reset by peer
                   at sun.nio.ch.F= ileDispatcher.read0(Native Method)
                   at sun.nio.ch.S= ocketDispatcher.read(SocketDispatcher.java:21)
                   at sun.nio.ch.I= OUtil.readIntoNativeBuffer(IOUtil.java:202)
                   at sun.nio.ch.I= OUtil.read(IOUtil.java:175)
                   at sun.nio.ch.S= ocketChannelImpl.read(SocketChannelImpl.java:243)
                   at org.apache.h= adoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)=
                   at org.apache.h= adoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
                   at org.apache.h= adoop.net.SocketInputStream.read(SocketInputStream.java:155)
                   at org.apache.h= adoop.net.SocketInputStream.read(SocketInputStream.java:128)
                   at java.io.Filt= erInputStream.read(FilterInputStream.java:116)
                   at org.apache.h= adoop.ipc.Client$Connection$PingInputStream.read(Client.java:343)
                   at java.io.Buff= eredInputStream.fill(BufferedInputStream.java:218)
                   at java.io.Buff= eredInputStream.read(BufferedInputStream.java:237)
                   at java.io.Data= InputStream.readInt(DataInputStream.java:370)
                   at org.apache.h= adoop.ipc.Client$Connection.receiveResponse(Client.java:767)
                   at org.apache.h= adoop.ipc.Client$Connection.run(Client.java:712)
     





    From:        
    Yuanyuan Tian/Almaden/IBM@IBMUS 
    To:        
    user= @giraph.apache.org =
    Cc:        
    user= @giraph.apache.org =
    Date:        
    06/27/2012 10:02 PM <= /font>
    Subject:        
    Re: wierd communication errors 




    What do you mean using netty? I am not aware that Giraph is using netty= . I am just using what ever the default  giraph release 1.0 is usi= ng.
     

    Yuanyuan
     



    From:        
    Avery Ching <aching@apache.org&= gt; 
    To:        
    user= @giraph.apache.org =
    Date:        
    06/27/2012 07:57 PM <= /font>
    Subject:        
    Re: wierd communication errors 




    Same issue using netty as well?
     


    On 6/27/12 6:14 PM, Yuanyuan Tian wrote:

    Hi,
     

    I was running a giraph job where I constantly got the following communi= cation related errors. The symptom is that in super step 0, most of the= workers succeeded but a few of the workers produced the errors below, = the machines that caused the connection reset are different in each fai= led worker. To rule out the probability of the cluster setup error, I a= lso ran a different job and it worked fine. So, the error must be cause= d by this particular giraph job. My giraph job is just normal message p= ropagation type of job, except that the message is not a of a unique ty= pe. Therefore, I defined a special message type (also copied in this em= ail) that incorporates two different types of messages: integer message= and double array message.  I have tried all day but still couldn'= t ping point the source of the bug. Can anyone give me some hints on wh= at may have caused this error?
    &n= bsp;

    Thanks a lot,
     

    java.lang.IllegalStateException: run: Caught an unrecoverable exception= flush: Got ExecutionException
                 at org.apache.giraph.g= raph.GraphMapper.run(GraphMapper.java:859)
                 at org.apache.hadoop.m= apred.MapTask.runNewMapper(MapTask.java:763)
                 at org.apache.hadoop.m= apred.MapTask.run(MapTask.java:369)
                 at org.apache.hadoop.m= apred.Child$4.run(Child.java:259)
                 at java.security.Acces= sController.doPrivileged(Native Method)
                 at javax.security.auth= .Subject.doAs(Subject.java:396)
                 at org.apache.hadoop.s= ecurity.UserGroupInformation.doAs(UserGroupInformation.java:1059)
                 at org.apache.hadoop.m= apred.Child.main(Child.java:253)
    Caused by: java.lang.IllegalStateException: flush: Got ExecutionExcepti= on
                 at org.apache.giraph.c= omm.BasicRPCCommunications.flush(BasicRPCCommunications.java:1082)
                 at org.apache.giraph.g= raph.BspServiceWorker.finishSuperstep(BspServiceWorker.java:1080)
                 at org.apache.giraph.g= raph.GraphMapper.map(GraphMapper.java:806)
                 at org.apache.giraph.g= raph.GraphMapper.run(GraphMapper.java:850)
                 ... 7 more
    Caused by: java.util.concurrent.ExecutionException: java.lang.reflect.U= ndeclaredThrowableException
                 at java.util.concurren= t.FutureTask$Sync.innerGet(FutureTask.java:222)
                 at java.util.concurren= t.FutureTask.get(FutureTask.java:83)
                 at org.apache.giraph.c= omm.BasicRPCCommunications.flush(BasicRPCCommunications.java:1077)
                 ... 10 more
    Caused by: java.lang.reflect.UndeclaredThrowableException
                 at $Proxy3.getName(Unk= nown Source)
                 at org.apache.giraph.c= omm.BasicRPCCommunications$PeerFlushExecutor.run(BasicRPCCommunications= .java:335)
                 at java.util.concurren= t.Executors$RunnableAdapter.call(Executors.java:441)
                 at java.util.concurren= t.FutureTask$Sync.innerRun(FutureTask.java:303)
                 at java.util.concurren= t.FutureTask.run(FutureTask.java:138)
                 at java.util.concurren= t.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                 at java.util.concurren= t.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                 at java.lang.Thread.ru= n(Thread.java:662)
    Caused by: java.io.IOException: Call to idp35.almaden.ibm.com/172.16.0.= 35:30083 failed on local exception: java.io.IOException: Connection res= et by peer
                 at org.apache.hadoop.i= pc.Client.wrapException(Client.java:1065)
                 at org.apache.hadoop.i= pc.Client.call(Client.java:1033)
                 at org.apache.hadoop.i= pc.RPC$Invoker.invoke(RPC.java:224)
                 ... 8 more
    Caused by: java.io.IOException: Connection reset by peer
                 at sun.nio.ch.FileDisp= atcher.read0(Native Method)
                 at sun.nio.ch.SocketDi= spatcher.read(SocketDispatcher.java:21)
                 at sun.nio.ch.IOUtil.r= eadIntoNativeBuffer(IOUtil.java:202)
                 at sun.nio.ch.IOUtil.r= ead(IOUtil.java:175)
                 at sun.nio.ch.SocketCh= annelImpl.read(SocketChannelImpl.java:243)
                 at org.apache.hadoop.n= et.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
                 at org.apache.hadoop.n= et.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
                 at org.apache.hadoop.n= et.SocketInputStream.read(SocketInputStream.java:155)
                 at org.apache.hadoop.n= et.SocketInputStream.read(SocketInputStream.java:128)
                 at java.io.FilterInput= Stream.read(FilterInputStream.java:116)
                 at org.apache.hadoop.i= pc.Client$Connection$PingInputStream.read(Client.java:343)
                 at java.io.BufferedInp= utStream.fill(BufferedInputStream.java:218)
                 at java.io.BufferedInp= utStream.read(BufferedInputStream.java:237)
                 at java.io.DataInputSt= ream.readInt(DataInputStream.java:370)
                 at org.apache.hadoop.i= pc.Client$Connection.receiveResponse(Client.java:767)
                 at org.apache.hadoop.i= pc.Client$Connection.run(Client.java:712)
     

    My special messge type:
     

    public
     class MyMessageWritable <= font size=3D"2" color=3D"#820040" face=3D"serif">implements Writable{ 

         
    public=  byte msgType
    =3D0;
     
         
    public=  long vertexID=3D-1; 
         
    public=  double[] arrayMsg=3Dnull; 
         public=  int intMsg=3D-1; 
         
         
    public=  MyMessageWritable () =
         {
     
         }
     
         
         
    public=  MyMessageWritable (long&nbs= p;id, byte tp, = int
     msg)       { 
                 
    vertexID=3Did;&nb= sp;
                 
    msgType=3Dtp;&nbs= p;
                 
    intMsg=3Dmsg;&nbs= p;
         }
     
         
         
    public=  MyMessageWritable (long&nbs= p;id, byte tp, = double[] arr)       { 
                 
    vertexID=3Did;&nb= sp;
                 
    msgType=3Dtp;&nbs= p;
                 
    arrayMsg=3Darr;&n= bsp;
         }
     
         
         
    @Override 
         
    public=  void readFields(= DataInput in) throws I= OException { 
                 
    vertexID=3Din.readLong(); 
                 
    msgType=3Din.readByte(); 
                 
    switch(msgType) 
                 {
                   case 1: 
                 
    case 4: 
                        =  
    in= tMsg=3Din.readInt(); 
                        =  
    break; 
                 
    case 2: 
                 
    case 3: 
                        =  
    if(arrayMsg=3D=3Dnull 
                        =          
    arrayMsg=3Dnew double<= font size=3D"2" face=3D"Courier New">[MyVertex.
    K]; 
                        =  
    for(int i=3D0; i<MyVertex.K; i++) <= /font>
                        =          
    arrayMsg[i]=3Din.readDouble();&n= bsp;
                        =  
    break; 
                 
    default:&n= bsp;
                        =          
    throw new = IOException("message type invalid: "+msgType); 

                 }
           } 

         
    @Override 
         
    public=  void write(DataO= utput out) throws IOEx= ception { 
                 out.writeLong(
    <= font size=3D"2" color=3D"#0021BF" face=3D"serif">vertexID
    ); 
                 out.writeByte(
    <= font size=3D"2" color=3D"#0021BF" face=3D"serif">msgType
    );=  
                 
    switch(msgType) 
                 {
                   case 1: 
                 
    case 4: 
                        =  out.writeInt(
    intMsg); 
                        =  
    break; 
                 
    case 2: 
                 
    case 3: 
                        =  
    if(arrayMsg=3D=3Dnull 
                        =          
    throw new = IOException("array message is null"); 
                        =  
    for(int i=3D0; i<MyVertex.K; i++) <= /font>
                        =          out.writeDouble(
    arrayMsg[i]); 
                        =  
    break; 
                 
    default:&n= bsp;
                        =          
    throw new = IOException("message type invalid: "+msgType); 

                 }
                   
         }
     
    =


= --1__=0ABBF0BFDFF7DE268f9e8a93df938690918c0ABBF0BFDFF7DE26-- --0__=0ABBF0BFDFF7DE268f9e8a93df938690918c0ABBF0BFDFF7DE26 Content-type: image/gif; name="graycol.gif" Content-Disposition: inline; filename="graycol.gif" Content-ID: <1__=0ABBF0BFDFF7DE268f9e8a93df938@us.ibm.com> Content-transfer-encoding: base64 R0lGODlhEAAQAKECAMzMzAAAAP///wAAACH5BAEAAAIALAAAAAAQABAAAAIXlI+py+0PopwxUbpu ZRfKZ2zgSJbmSRYAIf4fT3B0aW1pemVkIGJ5IFVsZWFkIFNtYXJ0U2F2ZXIhAAA7 --0__=0ABBF0BFDFF7DE268f9e8a93df938690918c0ABBF0BFDFF7DE26--