Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
Received-SPF: pass (athena.apache.org: domain of evolvah@gmail.com designates
 209.85.128.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CABaj-QY+y3hr8eb-KF3MqAxXW+EeC8pw_mdNftjQ6j+R5hatNg@mail.gmail.com>
References: 
 <CABaj-QaL9g5D0FvVDGxkEQWkLKo4-yus+z0KXAioAFMiXKmggQ@mail.gmail.com>
	<AA0B8A9CF5974846889A3DF776C1DF17569C942D@PRN-MBX02-2.TheFacebook.com>
	<CABaj-QY+y3hr8eb-KF3MqAxXW+EeC8pw_mdNftjQ6j+R5hatNg@mail.gmail.com>
Date: Mon, 8 Jul 2013 23:42:16 -0500
Message-ID: 
 <CAB3mbkRJORRAZ=LskeRBPZ7-g=BXTFO7+4mLnuJ56jF_0k=9Zg@mail.gmail.com>
Subject: Re: Efficient backup and a reasonable restore of an ensemble
From: Sergey Maslyakov <evolvah@gmail.com>
To: user@zookeeper.apache.org
Content-Type: multipart/alternative; boundary=047d7b6da6161df22a04e10cc9db

--047d7b6da6161df22a04e10cc9db
Content-Type: text/plain; charset=ISO-8859-1

Kishore,

This sounds like a very elaborate tool. I was trying to find a simplistic
approach but what Thawan said about "fuzzy snapshots" makes me a little
afraid that there is no simple solution.


On Mon, Jul 8, 2013 at 11:05 PM, kishore g <g.kishore@gmail.com> wrote:

> Agree, we already have such a tool. In fact we use it to reconstruct the
> sequence of events that led to a failure and actually restore the system to
> a previous stable point and replay the events. Unfortunately this is tied
> closely with Helix but it should be easy to make this a generic tool.
>
> Sergey is this something that will be useful in your case.
>
> Thanks,
> Kishore G
>
>
> On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <thawan@fb.com> wrote:
>
> > On restore part, I think having a separate utility to manipulate the
> > data/snap dir (by truncating the log/removing snapshot to a given zxid)
> > would be easier than modifying the server.
> >
> >
> > --
> > Thawan Kooburat
> >
> >
> >
> >
> >
> > On 7/8/13 6:34 PM, "kishore g" <g.kishore@gmail.com> wrote:
> >
> > >I think what we are looking at is a  point in time restore
> functionality.
> > >How about adding a feature that says go back to a specific
> zxid/timestamp.
> > >This way before doing any change to zookeeper simply note down the
> > >timestamp/zxid on leader. If things go wrong after making changes, bring
> > >down zookeepers and provide additional parameter of a zxid/timestamp
> while
> > >restarting. The server can go the exact point and make it current. The
> > >followers can be started blank.
> > >
> > >
> > >
> > >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <thawan@fb.com> wrote:
> > >
> > >> Just saw that  this is the corresponding use case to the question
> posted
> > >> in dev list.
> > >>
> > >> In order to restore the data to a given point in time correctly, you
> > >>need
> > >> both snapshot and txnlog. This is because zookeeper snapshot is fuzzy
> > >>and
> > >> snapshot alone may not represent a valid state of the server if there
> > >>are
> > >> in-flight requests.
> > >>
> > >> The 4wl command should cause the server to roll the log and take a
> > >> snapshot similar to periodic snapshotting operation. Your backup
> script
> > >> need grap the snapshot and corresponding txnlog file from the data
> dir.
> > >>
> > >> To restore, just shutdown all hosts, clear the data dir, copy over the
> > >> snapshot and txnlog, and restart them.
> > >>
> > >>
> > >> --
> > >> Thawan Kooburat
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <evolvah@gmail.com> wrote:
> > >>
> > >> >Thank you for your response, Flavio. I apologize, I did not provide a
> > >> >clear
> > >> >explanation of the use case.
> > >> >
> > >> >This backup/restore is not intended to be tied to any write event,
> > >> >instead,
> > >> >it is expected to run as a periodic (daily?) cron job on one of the
> > >> >servers, which is not guaranteed to be the leader of the ensemble.
> > >>There
> > >> >is
> > >> >no expectation that all recent changes are committed and persisted to
> > >> >disk.
> > >> >The system can sustain the loss of several hours worth of recent
> > >>changes
> > >> >in
> > >> >the event of restore.
> > >> >
> > >> >As for finding the leader dynamically and performing backup on it,
> this
> > >> >approach could be more difficult as the leader can change time to
> time
> > >>and
> > >> >I still need to fetch the file to store it in my designated backup
> > >> >location. Taking backup on one server and picking it up from a local
> > >>file
> > >> >system looks less error-prone. Even if I went the fancy route and had
> > >> >Zookeeper send me the serialized DataTree in response to the 4wl,
> this
> > >> >approach would involve a lot of moving parts.
> > >> >
> > >> >I have already made a PoC for a new 4wl that invokes takeSnapshot()
> and
> > >> >returns an absolute path to the snapshot it drops on disk. I have
> > >>already
> > >> >protected takeSnapshot() from concurrent invocation, which is likely
> to
> > >> >corrupt the snapshot file on disk. This approach works but I'm
> > >>thinking to
> > >> >take it one step further by providing the desired path name as an
> > >>argument
> > >> >to my new 4lw and to have Zookeeper server drop the snapshot into the
> > >> >specified file and report success/failure back. This way I can avoid
> > >> >cluttering the data directory and interfering with what Zookeeper
> finds
> > >> >when it scans the data directory.
> > >> >
> > >> >Approach with having an additional server that would take the
> > >>leadership
> > >> >and populate the ensemble is just a theory. I don't see a clean way
> of
> > >> >making a quorum member the leader of the quorum. Am I overlooking
> > >> >something
> > >> >simple?
> > >> >
> > >> >In backup and restore of an ensemble the biggest unknown for me
> remains
> > >> >populating the ensemble with desired data. I can think of two ways:
> > >> >
> > >> >1. Clear out all servers by stopping them, purge version-2
> directories,
> > >> >restore a snapshot file on one server that will be brought first, and
> > >>then
> > >> >bring up the rest of the ensemble. This way I somewhat force the
> first
> > >> >server to be the leader because it has data and it will be the only
> > >>member
> > >> >of a quorum with data, provided to the way I start the ensemble. This
> > >> >looks
> > >> >like a hack, though.
> > >> >
> > >> >2. Clear out the ensemble and reload it with a dedicated client using
> > >>the
> > >> >provided Zookeeper API.
> > >> >
> > >> >With the approach of backing up an actual snapshot file, option #1
> > >>appears
> > >> >to be more practical.
> > >> >
> > >> >I wish I could start the ensemble with a designate leader that would
> > >> >bootstrap the ensemble with data and then the ensemble would go into
> > >>its
> > >> >normal business...
> > >> >
> > >> >
> > >> >
> > >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> > >> ><fpjunqueira@yahoo.com>wrote:
> > >> >
> > >> >> One bit that is still a bit confusing to me in your use case is if
> > >>you
> > >> >> need to take a snapshot right after some event in your application.
> > >> >>Even if
> > >> >> you're able to tell ZooKeeper to take a snapshot, there is no
> > >>guarantee
> > >> >> that it will happen at the exact point you want it if update
> > >>operations
> > >> >> keep coming.
> > >> >>
> > >> >> If you use your four-letter word approach, then would you search
> for
> > >>the
> > >> >> leader or would you simply take a snapshot at any server? If it has
> > >>to
> > >> >>go
> > >> >> through the leader so that you make sure to have the most recent
> > >> >>committed
> > >> >> state, then it might not be a bad idea to have an api call that
> tells
> > >> >>the
> > >> >> leader to take a snapshot at some directory of your choice.
> Informing
> > >> >>you
> > >> >> the name of the snapshot file so that you can copy sounds like an
> > >> >>option,
> > >> >> but perhaps it is not as convenient.
> > >> >>
> > >> >> The approach of adding another server is not very clear. How do you
> > >> >>force
> > >> >> it to be the leader? Keep in mind that if it crashes, then it will
> > >>lose
> > >> >> leadership.
> > >> >>
> > >> >> -Flavio
> > >> >>
> > >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <evolvah@gmail.com>
> > >>wrote:
> > >> >>
> > >> >> > It looks like the "dev" mailing list is rather inactive. Over the
> > >>past
> > >> >> few
> > >> >> > days I only saw several automated emails from JIRA and this is
> > >>pretty
> > >> >> much
> > >> >> > it. Contrary to this, the "user" mailing list seems to be more
> > >>alive
> > >> >>and
> > >> >> > more populated.
> > >> >> >
> > >> >> > With this in mind, please allow me to cross-post here the
> message I
> > >> >>sent
> > >> >> > into the "dev" list a few days ago.
> > >> >> >
> > >> >> >
> > >> >> > Regards,
> > >> >> > /Sergey
> > >> >> >
> > >> >> > === forwarded message begins here ===
> > >> >> >
> > >> >> > Hi!
> > >> >> >
> > >> >> > I'm facing the problem that has been raised by multiple people
> but
> > >> >>none
> > >> >> of
> > >> >> > the discussion threads seem to provide a good answer. I dug in
> > >> >>Zookeeper
> > >> >> > source code trying to come up with some possible approaches and I
> > >> >>would
> > >> >> > like to get your inputs on those.
> > >> >> >
> > >> >> > Initial conditions:
> > >> >> >
> > >> >> > * I have an ensemble of five Zookeeper servers running v3.4.5
> code.
> > >> >> > * The size of a committed snapshot file is in vicinity of 1GB.
> > >> >> > * There are about 80 clients connected to the ensemble.
> > >> >> > * Clients a heavily read biased, i.e., they mostly read and
> rarely
> > >> >> write. I
> > >> >> > would say less than 0.1% of queries modify the data.
> > >> >> >
> > >> >> > Problem statement:
> > >> >> >
> > >> >> > * Under certain conditions, I may need to revert the data stored
> in
> > >> >>the
> > >> >> > ensemble to an earlier state. For example, one of the clients may
> > >>ruin
> > >> >> the
> > >> >> > application-level data integrity and I need to perform a disaster
> > >> >> recovery.
> > >> >> >
> > >> >> > Things look nice and easy if I'm dealing with a single Zookeeper
> > >> >>server.
> > >> >> A
> > >> >> > file-level copy of the data and dataLog directories should allow
> > >>me to
> > >> >> > recover later by stopping Zookeeper, swapping the corrupted data
> > >>and
> > >> >> > dataLog directories with a backup, and firing Zookeeper back up.
> > >> >> >
> > >> >> > Now, the ensemble deployment and the leader election algorithm in
> > >>the
> > >> >> > quorum make things much more difficult. In order to restore from
> a
> > >> >>single
> > >> >> > file-level backup, I need to take the whole ensemble down, wipe
> out
> > >> >>data
> > >> >> > and dataLog directories on all servers, replace these directories
> > >>with
> > >> >> > backed up content on one of the servers, bring this server up
> > >>first,
> > >> >>and
> > >> >> > then bring up the rest of the ensemble. This [somewhat]
> guarantees
> > >> >>that
> > >> >> the
> > >> >> > populated Zookeeper server becomes a member of a majority and
> > >> >>populates
> > >> >> the
> > >> >> > ensemble. This approach works but it is very involving and, thus,
> > >> >> > error-prone due to a human error.
> > >> >> >
> > >> >> > Based on a study of Zookeeper source code, I am considering the
> > >> >>following
> > >> >> > alternatives. And I seek advice from Zookeeper development
> > >>community
> > >> >>as
> > >> >> to
> > >> >> > which approach looks more promising or if there is a better way.
> > >> >> >
> > >> >> > Approach #1:
> > >> >> >
> > >> >> > Develop a complementary pair of utilities for export and import
> of
> > >>the
> > >> >> > data. Both utilities will act as Zookeeper clients and use the
> > >> >>existing
> > >> >> > API. The "export" utility will recursively retrieve data and
> store
> > >>it
> > >> >>in
> > >> >> a
> > >> >> > file. The "import" utility will first purge all data from the
> > >>ensemble
> > >> >> and
> > >> >> > then reload it from the file.
> > >> >> >
> > >> >> > This approach seems to be the simplest and there are similar
> tools
> > >> >> > developed already. For example, the Guano Project:
> > >> >> > https://github.com/d2fn/guano
> > >> >> >
> > >> >> > I don't like two things about it:
> > >> >> > * Poor performance even on a backup for the data store of my
> size.
> > >> >> > * Possible data consistency issues due to concurrent access by
> the
> > >> >>export
> > >> >> > utility as well as other "normal" clients.
> > >> >> >
> > >> >> > Approach #2:
> > >> >> >
> > >> >> > Add another four-letter command that would force rolling up the
> > >> >> > transactions and creating a snapshot. The result of this command
> > >>would
> > >> >> be a
> > >> >> > new snapshot.XXXX file on disk and the name of the file could be
> > >> >>reported
> > >> >> > back to the client as a response to the four-letter command. This
> > >> >>way, I
> > >> >> > would know which snapshot file to grab for future possible
> restore.
> > >> >>But
> > >> >> > restoring from a snapshot file is almost as involving as the
> > >> >>error-prone
> > >> >> > sequence described in the "Initial conditions" above.
> > >> >> >
> > >> >> > Approach #3:
> > >> >> >
> > >> >> > Come up with a way to temporarily add a new Zookeeper server
> into a
> > >> >>live
> > >> >> > ensemble, that would overtake (how?) the leader role and push out
> > >>the
> > >> >> > snapshot that it has into all ensemble members upon restore. This
> > >> >> approach
> > >> >> > could be difficult and error-prone to implement because it will
> > >> >>require
> > >> >> > hacking the existing election algorithm to designate a leader.
> > >> >> >
> > >> >> > So, which of the approaches do you think works best for an
> ensemble
> > >> >>and
> > >> >> for
> > >> >> > the database size of about 1GB?
> > >> >> >
> > >> >> >
> > >> >> > Any advice will be highly appreciated!
> > >> >> > /Sergey
> > >> >>
> > >> >>
> > >>
> > >>
> >
> >
>

--047d7b6da6161df22a04e10cc9db--