Return-Path: X-Original-To: apmail-giraph-user-archive@www.apache.org Delivered-To: apmail-giraph-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4F805D561 for ; Thu, 28 Jun 2012 01:15:18 +0000 (UTC) Received: (qmail 79868 invoked by uid 500); 28 Jun 2012 01:15:18 -0000 Delivered-To: apmail-giraph-user-archive@giraph.apache.org Received: (qmail 79826 invoked by uid 500); 28 Jun 2012 01:15:17 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 79818 invoked by uid 99); 28 Jun 2012 01:15:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Jun 2012 01:15:17 +0000 X-ASF-Spam-Status: No, hits=-2.8 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_HI,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ytian@us.ibm.com designates 32.97.182.138 as permitted sender) Received: from [32.97.182.138] (HELO e8.ny.us.ibm.com) (32.97.182.138) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Jun 2012 01:15:05 +0000 Received: from /spool/local by e8.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 27 Jun 2012 21:14:43 -0400 Received: from d01dlp02.pok.ibm.com (9.56.224.85) by e8.ny.us.ibm.com (192.168.1.108) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Wed, 27 Jun 2012 21:14:43 -0400 Received: from d01relay05.pok.ibm.com (d01relay05.pok.ibm.com [9.56.227.237]) by d01dlp02.pok.ibm.com (Postfix) with ESMTP id 6F9BB6E804C for ; Wed, 27 Jun 2012 21:14:42 -0400 (EDT) Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay05.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q5S1EgOw185734 for ; Wed, 27 Jun 2012 21:14:42 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q5S1EfsQ017642 for ; Wed, 27 Jun 2012 22:14:41 -0300 Received: from d01ml604.pok.ibm.com (d01ml604.pok.ibm.com [9.56.227.90]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id q5S1Ef2s017614 for ; Wed, 27 Jun 2012 22:14:41 -0300 To: user@giraph.apache.org MIME-Version: 1.0 Subject: wierd communication errors X-KeepSent: 3636024A:E890B71F-85257A2B:00048532; type=4; name=$KeepSent X-Mailer: Lotus Notes Release 8.5.1FP5 SHF29 November 12, 2010 Message-ID: From: Yuanyuan Tian Date: Wed, 27 Jun 2012 18:14:36 -0700 X-MIMETrack: Serialize by Router on D01ML604/01/M/IBM(Release 8.5.3HF266 | January 13, 2012) at 06/27/2012 21:14:41, Serialize complete at 06/27/2012 21:14:41 Content-Type: multipart/alternative; boundary="=_alternative 0006D3EA88257A2B_=" X-Content-Scanned: Fidelis XPS MAILER x-cbid: 12062801-9360-0000-0000-000007D9E963 This is a multipart message in MIME format. --=_alternative 0006D3EA88257A2B_= Content-Type: text/plain; charset="US-ASCII" Hi, I was running a giraph job where I constantly got the following communication related errors. The symptom is that in super step 0, most of the workers succeeded but a few of the workers produced the errors below, the machines that caused the connection reset are different in each failed worker. To rule out the probability of the cluster setup error, I also ran a different job and it worked fine. So, the error must be caused by this particular giraph job. My giraph job is just normal message propagation type of job, except that the message is not a of a unique type. Therefore, I defined a special message type (also copied in this email) that incorporates two different types of messages: integer message and double array message. I have tried all day but still couldn't ping point the source of the bug. Can anyone give me some hints on what may have caused this error? Thanks a lot, java.lang.IllegalStateException: run: Caught an unrecoverable exception flush: Got ExecutionException at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:859) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) Caused by: java.lang.IllegalStateException: flush: Got ExecutionException at org.apache.giraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:1082) at org.apache.giraph.graph.BspServiceWorker.finishSuperstep(BspServiceWorker.java:1080) at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:806) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:850) ... 7 more Caused by: java.util.concurrent.ExecutionException: java.lang.reflect.UndeclaredThrowableException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.giraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:1077) ... 10 more Caused by: java.lang.reflect.UndeclaredThrowableException at $Proxy3.getName(Unknown Source) at org.apache.giraph.comm.BasicRPCCommunications$PeerFlushExecutor.run(BasicRPCCommunications.java:335) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Call to idp35.almaden.ibm.com/172.16.0.35:30083 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:1065) at org.apache.hadoop.ipc.Client.call(Client.java:1033) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224) ... 8 more Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) at sun.nio.ch.IOUtil.read(IOUtil.java:175) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:343) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:767) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:712) My special messge type: public class MyMessageWritable implements Writable{ public byte msgType=0; public long vertexID=-1; public double[] arrayMsg=null; public int intMsg=-1; public MyMessageWritable () { } public MyMessageWritable (long id, byte tp, int msg) { vertexID=id; msgType=tp; intMsg=msg; } public MyMessageWritable (long id, byte tp, double[] arr) { vertexID=id; msgType=tp; arrayMsg=arr; } @Override public void readFields(DataInput in) throws IOException { vertexID=in.readLong(); msgType=in.readByte(); switch(msgType) { case 1: case 4: intMsg=in.readInt(); break; case 2: case 3: if(arrayMsg==null) arrayMsg=new double[MyVertex.K]; for(int i=0; iHi,

I was running a giraph job where I constantly got the following communication related errors. The symptom is that in super step 0, most of the workers succeeded but a few of the workers produced the errors below, the machines that caused the connection reset are different in each failed worker. To rule out the probability of the cluster setup error, I also ran a different job and it worked fine. So, the error must be caused by this particular giraph job. My giraph job is just normal message propagation type of job, except that the message is not a of a unique type. Therefore, I defined a special message type (also copied in this email) that incorporates two different types of messages: integer message and double array message.  I have tried all day but still couldn't ping point the source of the bug. Can anyone give me some hints on what may have caused this error?

Thanks a lot,

java.lang.IllegalStateException: run: Caught an unrecoverable exception flush: Got ExecutionException
                at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:859)
                at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
                at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:396)
                at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
                at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.IllegalStateException: flush: Got ExecutionException
                at org.apache.giraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:1082)
                at org.apache.giraph.graph.BspServiceWorker.finishSuperstep(BspServiceWorker.java:1080)
                at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:806)
                at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:850)
                ... 7 more
Caused by: java.util.concurrent.ExecutionException: java.lang.reflect.UndeclaredThrowableException
                at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
                at java.util.concurrent.FutureTask.get(FutureTask.java:83)
                at org.apache.giraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:1077)
                ... 10 more
Caused by: java.lang.reflect.UndeclaredThrowableException
                at $Proxy3.getName(Unknown Source)
                at org.apache.giraph.comm.BasicRPCCommunications$PeerFlushExecutor.run(BasicRPCCommunications.java:335)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
                at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
                at java.util.concurrent.FutureTask.run(FutureTask.java:138)
                at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Call to idp35.almaden.ibm.com/172.16.0.35:30083 failed on local exception: java.io.IOException: Connection reset by peer
                at org.apache.hadoop.ipc.Client.wrapException(Client.java:1065)
                at org.apache.hadoop.ipc.Client.call(Client.java:1033)
                at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
                ... 8 more
Caused by: java.io.IOException: Connection reset by peer
                at sun.nio.ch.FileDispatcher.read0(Native Method)
                at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
                at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
                at sun.nio.ch.IOUtil.read(IOUtil.java:175)
                at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
                at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
                at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
                at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
                at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
                at java.io.FilterInputStream.read(FilterInputStream.java:116)
                at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:343)
                at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
                at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
                at java.io.DataInputStream.readInt(DataInputStream.java:370)
                at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:767)
                at org.apache.hadoop.ipc.Client$Connection.run(Client.java:712)


My special messge type:

public class MyMessageWritable implements Writable{

        public byte msgType=0;
        public long vertexID=-1;
        public double[] arrayMsg=null;
        public int intMsg=-1;
       
        public MyMessageWritable ()
        {
        }
       
        public MyMessageWritable (long id, byte tp, int msg)
        {
                vertexID=id;
                msgType=tp;
                intMsg=msg;
        }
       
        public MyMessageWritable (long id, byte tp, double[] arr)
        {
                vertexID=id;
                msgType=tp;
                arrayMsg=arr;
        }
       
        @Override
        public void readFields(DataInput in) throws IOException {
                vertexID=in.readLong();
                msgType=in.readByte();
                switch(msgType)
                {
                case 1:
                case 4:
                        intMsg=in.readInt();
                        break;
                case 2:
                case 3:
                        if(arrayMsg==null)
                                arrayMsg=new double[MyVertex.K];
                        for(int i=0; i<MyVertex.K; i++)
                                arrayMsg[i]=in.readDouble();
                        break;
                default:
                                throw new IOException("message type invalid: "+msgType);
                }
        }

        @Override
        public void write(DataOutput out) throws IOException {
                out.writeLong(vertexID);
                out.writeByte(msgType);
                switch(msgType)
                {
                case 1:
                case 4:
                        out.writeInt(intMsg);
                        break;
                case 2:
                case 3:
                        if(arrayMsg==null)
                                throw new IOException("array message is null");
                        for(int i=0; i<MyVertex.K; i++)
                                out.writeDouble(arrayMsg[i]);
                        break;
                default:
                                throw new IOException("message type invalid: "+msgType);
                }
               
        } --=_alternative 0006D3EA88257A2B_=--