giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suijian Zhou <suijian.z...@gmail.com>
Subject Re: zookeeper problem in giraph..
Date Tue, 08 Apr 2014 14:18:09 GMT
Hi, Lukas,
  Do you know how to modify the timeout settings for zookeeper in giraph? I
see the session is established on server with negotiated timeout = 600000,
which is 600s, I think this is enough for the job as the job get aborted in
only few minutes. Really confused here, why the server closed the
connection in shorter time than 600s?

14/04/07 16:50:05 INFO mapred.JobClient: Running job: job_201404071009_0042
14/04/07 16:50:05 INFO zookeeper.ClientCnxn: Opening socket connection to
server compute-0-19.local/10.1.255.235:22181. Will not attempt to
authenticate using SASL (unknown error)
14/04/07 16:50:05 INFO zookeeper.ClientCnxn: Socket connection established
to compute-0-19.local/10.1.255.235:22181, initiating session
14/04/07 16:50:05 INFO zookeeper.ClientCnxn: Session establishment complete
on server compute-0-19.local/10.1.255.235:22181, sessionid =
0x1453e2b3cca0009, negotiated timeout = 600000
......
......
14/04/07 16:51:27 INFO zookeeper.ClientCnxn: Unable to read additional data
from server sessionid 0x1453e2b3cca0009, likely server has closed socket,
closing socket connection and attempting reconnect
14/04/07 16:51:29 INFO zookeeper.ClientCnxn: Opening socket connection to
server compute-0-19.local/10.1.255.235:22181. Will not attempt to
authenticate using SASL (unknown error)

  Best Regards,
  Suijian



2014-04-07 16:59 GMT-05:00 Suijian Zhou <suijian.zhou@gmail.com>:

> Hi, Lukas,
>   Got the patch applied to
> giraph-core/src/main/java/org/apache/giraph/comm/netty/NettyClient.java and
> recompiled giraph by "mvn compile", but still the same error:
>
> 14/04/07 16:51:26 INFO job.JobProgressTracker: Data from 8 workers -
> Compute superstep 2: 0 out of 4847571 vertices computed; 0 out of 64
> partitions computed; min free memory on worker 5 - 270.76MB, average
> 394.74MB
> 14/04/07 16:51:27 INFO zookeeper.ClientCnxn: Unable to read additional
> data from server sessionid 0x1453e2b3cca0009, likely server has closed
> socket, closing socket connection and attempting reconnect
> 14/04/07 16:51:29 INFO zookeeper.ClientCnxn: Opening socket connection to
> server compute-0-19.local/10.1.255.235:22181. Will not attempt to
> authenticate using SASL (unknown error)
> 14/04/07 16:51:29 WARN zookeeper.ClientCnxn: Session 0x1453e2b3cca0009 for
> server null, unexpected error, closing socket connection and attempting
> reconnect
> java.net.ConnectException: Connection refused
>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>     at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>     at
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
>     at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
> 14/04/07 16:51:31 INFO zookeeper.ClientCnxn: Opening socket connection to
> server compute-0-19.local/10.1.255.235:22181. Will not attempt to
> authenticate using SASL (unknown error)
>
>   I tried to modify some parameters in:
> ./giraph-core/src/main/java/org/apache/giraph/conf/GiraphConstants.java
> like DEFAULT_ZOOKEEPER_MAX_CLIENT_CNXNS
> but seems have no effect. Any hints?
>
>   Best Regards,
>   Suijian
>
>
>
> 2014-04-07 9:34 GMT-05:00 Suijian Zhou <suijian.zhou@gmail.com>:
>
> Hi, Lukas,
>>   Thank you, but when I tried to apply the patch, I got:
>> 2014.04.07|09:25:47~/giraph/giraph-core/src> git apply --check
>> NettyClient_Timeout.patch
>> error: patch failed:
>> giraph-core/src/main/java/org/apache/giraph/comm/netty/NettyClient.java:153
>> error:
>> giraph-core/src/main/java/org/apache/giraph/comm/netty/NettyClient.java:
>> patch does not apply
>>
>>   Could you send me directly the new patched NettyClient.java file?
>> Thanks!
>>
>>   Best Regards,
>>   Suijian
>>
>>
>>
>> 2014-04-04 17:12 GMT-05:00 Lukas Nalezenec <
>> lukas.nalezenec@firma.seznam.cz>:
>>
>>  Hi,
>>>
>>> I had similar issue, it was caused by long GC pauses. I patched
>>> NettyClient so when reconnect fails it sleeps for some time before next
>>> try. Patch is enclosed. Let me know if it works for you.
>>> I would try tuning GC. You can also try to use
>>> giraph.waitForRequestsConfirmation and giraph.maxNumberOfOpenRequests .
>>> I hope I am right.
>>>
>>> Regards
>>> Lukas
>>>
>>>
>>> On 4.4.2014 22:49, Suijian Zhou wrote:
>>>
>>>   Hi,
>>>   I have a zookeeper problem when running a giraph program, the program
>>> will be aborted in superstep 2 as:
>>> 14/04/04 15:44:48 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to server compute-0-18.local/10.1.255.236:22181. Will not attempt to
>>> authenticate using SASL (unknown error)
>>> 14/04/04 15:44:48 INFO zookeeper.ClientCnxn: Socket connection
>>> established to compute-0-18.local/10.1.255.236:22181, initiating session
>>> 14/04/04 15:44:48 INFO zookeeper.ClientCnxn: Session establishment
>>> complete on server compute-0-18.local/10.1.255.236:22181, sessionid =
>>> 0x1452e7c79910009, negotiated timeout = 600000
>>> ......
>>> 14/04/04 15:46:08 INFO job.JobProgressTracker: Data from 8 workers -
>>> Compute superstep 2: 0 out of 4847571 vertices computed; 0 out of 64
>>> partitions computed; min free memory on worker 3 - 270.37MB, average
>>> 451.21MB
>>> 14/04/04 15:46:13 INFO job.JobProgressTracker: Data from 8 workers -
>>> Compute superstep 2: 0 out of 4847571 vertices computed; 0 out of 64
>>> partitions computed; min free memory on worker 6 - 249.25MB, average
>>> 404.02MB
>>> 14/04/04 15:46:16 INFO zookeeper.ClientCnxn: Unable to read additional
>>> data from server sessionid 0x1452e7c79910009, likely server has closed
>>> socket, closing socket connection and attempting reconnect
>>> 14/04/04 15:46:17 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to server compute-0-18.local/10.1.255.236:22181. Will not attempt to
>>> authenticate using SASL (unknown error)
>>> 14/04/04 15:46:17 WARN zookeeper.ClientCnxn: Session 0x1452e7c79910009
>>> for server null, unexpected error, closing socket connection and attempting
>>> reconnect
>>> java.net.ConnectException: Connection refused
>>>
>>>
>>>  Each rerun of the program will lead to another computing node
>>> reporting the same error("Unable to read additional data from server
>>> sessionid...").
>>>
>>>  What in superstep 2 are:
>>>   if (getSuperstep() == 2) {
>>>     for (IntWritable message: messages) {
>>>         for (Edge<IntWritable, IntWritable> edge: vertex.getEdges()) {
>>>            sendMessage(edge.getTargetVertexId(), message);
>>>            //int abc=0;
>>>         }
>>>     }
>>>   }
>>>
>>>  Checked that if I replace the line
>>> "sendMessage(edge.getTargetVertexId(), message);" to another meaningless
>>> line like "int abc=0;", the program could be finished successfully. Seems a
>>> ZooKeeper problem but this seems comes with giraph as I did not install
>>> ZooKeeper seperately.  I tried to modify parameters in GiraphConstants.java
>>> and re-compile giraph, but it seems do not take any effects as I see in the
>>> screen output the parameters were not changed at all.  Any hints?
>>>
>>>    Best Regards,
>>>    Suijian
>>>
>>>
>>>
>>
>

Mime
View raw message