hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: zookeeper on ec2
Date Thu, 03 Sep 2009 16:12:36 GMT
Ted that's great feedback. I identified a couple of additional things to 
verify after reading your comments:

1) ensure that you don't have debug level logging turned on, see this:
https://issues.apache.org/jira/browse/ZOOKEEPER-518
(fixed in 3.2.1, but in general you probably don't want to run anything 
lower than info in production except when attempting to track down some 
problem).

2) it would be a good idea to review the server/client zk logs to see if 
there's any insight there as to what might be causing the high 
latencies. For example the other day we had an issue where client code 
was misbehaving and causing degraded performance of the server, 
reviewing the logs allowed the developer to identify the client problem 
and address.

Patrick

Ted Dunning wrote:
> I always used a large node for ZK to avoid sharing the machine, but the
> reason for doing that turned out to be incorrect.  In fact, my problem was
> to do with GC on the client side.
> 
> I can't believe that they are seeing 50 second delays in EC2 due to I/O
> contention.  GC can do that, but only on a large heap.  Massive swapping of
> code pages can also cause this.
> 
> My debug path here would be:
> 
> a) verify the facts.  The key fact is that the ZK cluster is occasionally
> giving massive latency.  This must be verified to be the real problem and
> not an accidental incident.  It is possible that the problem is not where we
> think it is.
> 
> b) check for the usual configuration suspects.  ZK should be alone on a
> machine.  DNS should be checked.  Connectivity should be checked between all
> hosts.
> 
> c) look for swapping, look at GC logs.  Something has to give a clue as to
> how the latency is 1000x longer than usual.
> 
> d) fix what came from (b) or (c) step.
> 
> I am at a loss here other than this general advice.  I strongly suspect that
> something is being observed incorrectly or the machines are being massively
> abused.
> 
> On Wed, Sep 2, 2009 at 12:37 PM, Patrick Hunt <phunt@apache.org> wrote:
> 
>> I suspect that given a single disk is being used (not a dedicated disk for
>> the transaction log), and also given that this host is highly virtualized
>> (ec2), it seems to me that the most likely cause is IO. Specifically when
>> the zk cluster writes data to disk (due to client write) it must sync the
>> transaction log to disk. This sync behavior can impact the latency seen by
>> the clients. What type of ec2 node are you using? Ted, do you have any
>> insight on this? Any guidelines for the type of ec2 node to use for running
>> a zk cluster?
>>
> 
> 
> 

Mime
View raw message