Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: zookeeper-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <4B95364F.6020006@apache.org>
Date: Mon, 08 Mar 2010 09:39:27 -0800
From: Patrick Hunt <phunt@apache.org>
User-Agent: Thunderbird 2.0.0.23 (X11/20090817)
MIME-Version: 1.0
To: zookeeper-user@hadoop.apache.org
Subject: Re: Managing multi-site clusters with Zookeeper
References: <C7B97006.30C65%mahadev@yahoo-inc.com>
	 <C5C581E9-FA46-4F86-8C05-19C85C25F1B2@gmail.com>
 <8bc75ecf1003080542t410703h28401fa3c05f0cf9@mail.gmail.com>
In-Reply-To: <8bc75ecf1003080542t410703h28401fa3c05f0cf9@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

IMO latency is the primary issue you will face, but also keep in mind 
reliability w/in a colo.

Say you have 3 colos (obv can't be 2), if you only have 3 servers, one 
in each colo, you will be reliable but clients w/in each colo will have 
to connect to a remote colo if the local fails. You will want to 
prioritize the local colo given that reads can be serviced entirely 
local that way. If you have 7 servers (2-2-3) that would be better - if 
a local server fails you have a redundant, if both fail then you go remote.

You want to keep your writes as few as possible and as small as 
possible? Why? Say you have 100ms latency btw colos, let's go through a 
scenario for a client in a colo where the local servers are not the 
leader (zk cluster leader).

read:
1) client reads a znode from local server
2) local server (usually < 1ms if "in colo" comm) responds in 1ms

write:
1) client writes a znode to local server A
2) A proposes change to the ZK Leader (L) in remote colo
3) L gets the proposal in 100ms
4) L proposes the change to all followers
5) all followers (not exactly, but hopefully) get the proposal in 100ms
6) followers ack the change
7) L gets the acks in 100ms
8) L commits the change (message to all followers)
9) A gets the commit in 100ms
10) A responds to client (< 1ms)

write latency: 100 + 100 + 100 + 100 = 400ms

Obviously keeping these writes small is also critical.

Patrick

Martin Waite wrote:
> Hi Ted,
> 
> If the links do not work for us for zk, then they are unlikely to work with
> any other solution - such as trying to stretch Pacemaker or Red Hat Cluster
> with their multicast protocols across the links.
> 
> If the links are not good enough, we might have to spend some more money to
> fix this.
> 
> regards,
> Martin
> 
> On 8 March 2010 02:14, Ted Dunning <ted.dunning@gmail.com> wrote:
> 
>> If you can stand the latency for updates then zk should work well for you.
>> It is unlikely that you will be able to better than zk does and still
>> maintain correctness.
>>
>> Do note that you can, probalbly bias client to use a local server. That
>> should make things more efficient.
>>
>> Sent from my iPhone
>>
>>
>> On Mar 7, 2010, at 3:00 PM, Mahadev Konar <mahadev@yahoo-inc.com> wrote:
>>
>>  The inter-site links are a nuisance.  We have two data-centres with 100Mb
>>>> links which I hope would be good enough for most uses, but we need a 3rd
>>>> site - and currently that only has 2Mb links to the other sites.  This
>>>> might
>>>> be a problem.
>>>>
>