hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: zookeeper on ec2
Date Wed, 02 Sep 2009 00:11:08 GMT
Depends on what your tests are. Are they pretty simple/light? then 
probably network issue. Heavy load testing? then might be the 
server/client, might be the network.

easiest thing is to run a ping test while running your zk test and see 
if pings are getting through (and latency). You should also review your 
client/server logs for any information during the CLoss.

Ted Dunning would be a good resource - he runs ZK inside ec2 and has 
alot of experience with it.

Patrick

Satish Bhatti wrote:
> For my initial testing I am running with a single ZooKeeper server, i.e. the
> ensemble only has one server.  Not sure if this is exacerbating the problem?
>  I will check out the trouble shooting link you sent me.
> 
> On Tue, Sep 1, 2009 at 5:01 PM, Patrick Hunt <phunt@apache.org> wrote:
> 
>> I'm not very familiar with ec2 environment, are you doing any monitoring?
>> In particular network connectivity btw nodes? Sounds like networking issues
>> btw nodes (I'm assuming you've also looked at stuff like this
>> http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting and verified that
>> you are not swapping (see gc pressure), etc...)
>>
>> Patrick
>>
>>
>> Satish Bhatti wrote:
>>
>>> Session timeout is 30 seconds.
>>>
>>> On Tue, Sep 1, 2009 at 4:26 PM, Patrick Hunt <phunt@apache.org> wrote:
>>>
>>>  What is your client timeout? It may be too low.
>>>> also see this section on handling recoverable errors:
>>>> http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling
>>>>
>>>> connection loss in particular needs special care since:
>>>> "When a ZooKeeper client loses a connection to the ZooKeeper server there
>>>> may be some requests in flight; we don't know where they were in their
>>>> flight at the time of the connection loss. "
>>>>
>>>> Patrick
>>>>
>>>>
>>>> Satish Bhatti wrote:
>>>>
>>>>  I have recently started running on EC2 and am seeing quite a few
>>>>> ConnectionLoss exceptions.  Should I just catch these and retry?  Since
>>>>> I
>>>>> assume that eventually, if the shit truly hits the fan, I will get a
>>>>> SessionExpired?
>>>>> Satish
>>>>>
>>>>> On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning <ted.dunning@gmail.com>
>>>>> wrote:
>>>>>
>>>>>  We have used EC2 quite a bit for ZK.
>>>>>
>>>>>> The basic lessons that I have learned include:
>>>>>>
>>>>>> a) EC2's biggest advantage after scaling and elasticity was conformity
>>>>>> of
>>>>>> configuration.  Since you are bringing machines up and down all the
>>>>>> time,
>>>>>> they begin to act more like programs and you wind up with boot scripts
>>>>>> that
>>>>>> give you a very predictable environment.  Nice.
>>>>>>
>>>>>> b) EC2 interconnect has a lot more going on than in a dedicated VLAN.
>>>>>>  That
>>>>>> can make the ZK servers appear a bit less connected.  You have to
plan
>>>>>> for
>>>>>> ConnectionLoss events.
>>>>>>
>>>>>> c) for highest reliability, I switched to large instances.  On
>>>>>> reflection,
>>>>>> I
>>>>>> think that was helpful, but less important than I thought at the
time.
>>>>>>
>>>>>> d) increasing and decreasing cluster size is nearly painless and
is
>>>>>> easily
>>>>>> scriptable.  To decrease, do a rolling update on the survivors to
>>>>>> update
>>>>>> their configuration.  Then take down the instance you want to lose.
 To
>>>>>> increase, do a rolling update starting with the new instances to
update
>>>>>> the
>>>>>> configuration to include all of the machines.  The rolling update
>>>>>> should
>>>>>> bounce each ZK with several seconds between each bounce.  Rescaling
the
>>>>>> cluster takes less than a minute which makes it comparable to EC2
>>>>>> instance
>>>>>> boot time (about 30 seconds for the Alestic ubuntu instance that
we
>>>>>> used
>>>>>> plus about 20 seconds for additional configuration).
>>>>>>
>>>>>> On Mon, Jul 6, 2009 at 4:45 AM, David Graf <david.graf@28msec.com>
>>>>>> wrote:
>>>>>>
>>>>>>  Hello
>>>>>>
>>>>>>> I wanna set up a zookeeper ensemble on amazon's ec2 service.
In my
>>>>>>>
>>>>>>>  system,
>>>>>>  zookeeper is used to run a locking service and to generate unique
>>>>>>> id's.
>>>>>>> Currently, for testing purposes, I am only running one instance.
Now,
>>>>>>> I
>>>>>>>
>>>>>>>  need
>>>>>>  to set up an ensemble to protect my system against crashes.
>>>>>>> The ec2 services has some differences to a normal server farm.
E.g.
>>>>>>> the
>>>>>>> data saved on the file system of an ec2 instance is lost if the
>>>>>>> instance
>>>>>>> crashes. In the documentation of zookeeper, I have read that
zookeeper
>>>>>>>
>>>>>>>  saves
>>>>>>  snapshots of the in-memory data in the file system. Is that needed
for
>>>>>>> recovery? Logically, it would be much easier for me if this is
not the
>>>>>>>
>>>>>>>  case.
>>>>>>  Additionally, ec2 brings the advantage that serves can be switch
on
>>>>>>> and
>>>>>>>
>>>>>>>  off
>>>>>>  dynamically dependent on the load, traffic, etc. Can this advantage
be
>>>>>>> utilized for a zookeeper ensemble? Is it possible to add a zookeeper
>>>>>>>
>>>>>>>  server
>>>>>>  dynamically to an ensemble? E.g. dependent on the in-memory load?
>>>>>>> David
>>>>>>>
>>>>>>>
>>>>>>>
> 

Mime
View raw message