Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
Subject: Re: Efficient backup and a reasonable restore of an ensemble
From: Flavio Junqueira <fpjunqueira@yahoo.com>
In-Reply-To: 
 <CAB3mbkTjTHE3FAFUwioHiBNg3Ynd0NaWJxwNXQonawLSokWEHQ@mail.gmail.com>
Date: Mon, 8 Jul 2013 23:30:18 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <09E54E74-5A21-4133-B0FF-94A33B66FE45@yahoo.com>
References: 
 <CAB3mbkSXy9V4F9_URQ1_YP4rMU1OEAKwWMi-JAzzNrA+4BdJPw@mail.gmail.com>
 <CAB3mbkTjTHE3FAFUwioHiBNg3Ynd0NaWJxwNXQonawLSokWEHQ@mail.gmail.com>
To: user@zookeeper.apache.org

One bit that is still a bit confusing to me in your use case is if you =
need to take a snapshot right after some event in your application. Even =
if you're able to tell ZooKeeper to take a snapshot, there is no =
guarantee that it will happen at the exact point you want it if update =
operations keep coming.

If you use your four-letter word approach, then would you search for the =
leader or would you simply take a snapshot at any server? If it has to =
go through the leader so that you make sure to have the most recent =
committed state, then it might not be a bad idea to have an api call =
that tells the leader to take a snapshot at some directory of your =
choice. Informing you the name of the snapshot file so that you can copy =
sounds like an option, but perhaps it is not as convenient.

The approach of adding another server is not very clear. How do you =
force it to be the leader? Keep in mind that if it crashes, then it will =
lose leadership.

-Flavio=20

On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <evolvah@gmail.com> wrote:

> It looks like the "dev" mailing list is rather inactive. Over the past =
few
> days I only saw several automated emails from JIRA and this is pretty =
much
> it. Contrary to this, the "user" mailing list seems to be more alive =
and
> more populated.
>=20
> With this in mind, please allow me to cross-post here the message I =
sent
> into the "dev" list a few days ago.
>=20
>=20
> Regards,
> /Sergey
>=20
> =3D=3D=3D forwarded message begins here =3D=3D=3D
>=20
> Hi!
>=20
> I'm facing the problem that has been raised by multiple people but =
none of
> the discussion threads seem to provide a good answer. I dug in =
Zookeeper
> source code trying to come up with some possible approaches and I =
would
> like to get your inputs on those.
>=20
> Initial conditions:
>=20
> * I have an ensemble of five Zookeeper servers running v3.4.5 code.
> * The size of a committed snapshot file is in vicinity of 1GB.
> * There are about 80 clients connected to the ensemble.
> * Clients a heavily read biased, i.e., they mostly read and rarely =
write. I
> would say less than 0.1% of queries modify the data.
>=20
> Problem statement:
>=20
> * Under certain conditions, I may need to revert the data stored in =
the
> ensemble to an earlier state. For example, one of the clients may ruin =
the
> application-level data integrity and I need to perform a disaster =
recovery.
>=20
> Things look nice and easy if I'm dealing with a single Zookeeper =
server. A
> file-level copy of the data and dataLog directories should allow me to
> recover later by stopping Zookeeper, swapping the corrupted data and
> dataLog directories with a backup, and firing Zookeeper back up.
>=20
> Now, the ensemble deployment and the leader election algorithm in the
> quorum make things much more difficult. In order to restore from a =
single
> file-level backup, I need to take the whole ensemble down, wipe out =
data
> and dataLog directories on all servers, replace these directories with
> backed up content on one of the servers, bring this server up first, =
and
> then bring up the rest of the ensemble. This [somewhat] guarantees =
that the
> populated Zookeeper server becomes a member of a majority and =
populates the
> ensemble. This approach works but it is very involving and, thus,
> error-prone due to a human error.
>=20
> Based on a study of Zookeeper source code, I am considering the =
following
> alternatives. And I seek advice from Zookeeper development community =
as to
> which approach looks more promising or if there is a better way.
>=20
> Approach #1:
>=20
> Develop a complementary pair of utilities for export and import of the
> data. Both utilities will act as Zookeeper clients and use the =
existing
> API. The "export" utility will recursively retrieve data and store it =
in a
> file. The "import" utility will first purge all data from the ensemble =
and
> then reload it from the file.
>=20
> This approach seems to be the simplest and there are similar tools
> developed already. For example, the Guano Project:
> https://github.com/d2fn/guano
>=20
> I don't like two things about it:
> * Poor performance even on a backup for the data store of my size.
> * Possible data consistency issues due to concurrent access by the =
export
> utility as well as other "normal" clients.
>=20
> Approach #2:
>=20
> Add another four-letter command that would force rolling up the
> transactions and creating a snapshot. The result of this command would =
be a
> new snapshot.XXXX file on disk and the name of the file could be =
reported
> back to the client as a response to the four-letter command. This way, =
I
> would know which snapshot file to grab for future possible restore. =
But
> restoring from a snapshot file is almost as involving as the =
error-prone
> sequence described in the "Initial conditions" above.
>=20
> Approach #3:
>=20
> Come up with a way to temporarily add a new Zookeeper server into a =
live
> ensemble, that would overtake (how?) the leader role and push out the
> snapshot that it has into all ensemble members upon restore. This =
approach
> could be difficult and error-prone to implement because it will =
require
> hacking the existing election algorithm to designate a leader.
>=20
> So, which of the approaches do you think works best for an ensemble =
and for
> the database size of about 1GB?
>=20
>=20
> Any advice will be highly appreciated!
> /Sergey