zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Efficient backup and a reasonable restore of an ensemble
Date Tue, 09 Jul 2013 20:00:51 GMT
The snapshot will include any or all of those 5 updates.  But the logs from
that time *will* include all 5.

Thus, if you apply that part of the log to the snapshot, you will either
overwrite the latest value (with no effect) or you will update the previous
value to the latest value for each of those 5 log entries.  There are
obviously 1024 possible alternatives here, but they all lead to the same
final state and that is exactly a moment in time snapshot as of the final
transaction.

The fuzzy term refers to the fact that you can't say exactly what time the
the original snapshot corresponds to.  In your example, the data in the
snapshot represents a combination of states from anywhere in the 30 second
window that it took to write the snapshot.



On Tue, Jul 9, 2013 at 9:02 AM, Sergey Maslyakov <evolvah@gmail.com> wrote:

> I think I am having difficulties understanding the "fuzzy" concept. Let's
> say I started to serialize DataTree into a snapshot file and it took 30
> seconds. During these 30 seconds, the server saw 5 transactions that
> updated the data. Does this mean that the snapshot that I get on disk at
> the end of the 30-second interval will have some of these 5 transactions?
> Or will it have none? Or will it have all of them? Or will it be
> inconsistent and unreadable by Zookeeper?
>
> Please help me better understand the behavior behind the "fuzzy" term.
>
> For my use case, I am perfectly fine if I get a snapshot with none of these
> 5 transactions, considering that I will pick them up next time I take a
> snapshot.
>
>
> /Sergey
>
>
> On Tue, Jul 9, 2013 at 12:08 AM, kishore g <g.kishore@gmail.com> wrote:
>
> > Its not really elaborate, it is very similar to what zookeeper does when
> it
> > starts up. It first reads the latest snapshot file and then the
> transaction
> > logs and applies each and every transaction. What I am suggesting is that
> > instead of applying all transactions stop at a transaction i provide.
> >
> > Having this tool will actually simplify your task, you can go back to any
> > point in time. Think of a something like this.
> >
> > checkpoint A // this can store the last zxid or timestamp from the
> leader.
> > Make changes to zk
> > //if things fails
> > stop zks
> > rollback A//run this on each zk, brings back the cluster to its previous
> > state.
> > start zks // any order should be fine.
> >
> >
> > Also keep in mind that snapshot is fuzzy only if there are writes
> happening
> > while taking snapshot. If you are sure no writes will happen when you are
> > taking the snapshot then you are good. Experts, please correct me if this
> > is incorrect.
> >
> > thanks,
> > Kishore G
> >
> >
> > On Mon, Jul 8, 2013 at 9:42 PM, Sergey Maslyakov <evolvah@gmail.com>
> > wrote:
> >
> > > Kishore,
> > >
> > > This sounds like a very elaborate tool. I was trying to find a
> simplistic
> > > approach but what Thawan said about "fuzzy snapshots" makes me a little
> > > afraid that there is no simple solution.
> > >
> > >
> > > On Mon, Jul 8, 2013 at 11:05 PM, kishore g <g.kishore@gmail.com>
> wrote:
> > >
> > > > Agree, we already have such a tool. In fact we use it to reconstruct
> > the
> > > > sequence of events that led to a failure and actually restore the
> > system
> > > to
> > > > a previous stable point and replay the events. Unfortunately this is
> > tied
> > > > closely with Helix but it should be easy to make this a generic tool.
> > > >
> > > > Sergey is this something that will be useful in your case.
> > > >
> > > > Thanks,
> > > > Kishore G
> > > >
> > > >
> > > > On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <thawan@fb.com>
> wrote:
> > > >
> > > > > On restore part, I think having a separate utility to manipulate
> the
> > > > > data/snap dir (by truncating the log/removing snapshot to a given
> > zxid)
> > > > > would be easier than modifying the server.
> > > > >
> > > > >
> > > > > --
> > > > > Thawan Kooburat
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 7/8/13 6:34 PM, "kishore g" <g.kishore@gmail.com> wrote:
> > > > >
> > > > > >I think what we are looking at is a  point in time restore
> > > > functionality.
> > > > > >How about adding a feature that says go back to a specific
> > > > zxid/timestamp.
> > > > > >This way before doing any change to zookeeper simply note down
the
> > > > > >timestamp/zxid on leader. If things go wrong after making changes,
> > > bring
> > > > > >down zookeepers and provide additional parameter of a
> zxid/timestamp
> > > > while
> > > > > >restarting. The server can go the exact point and make it current.
> > The
> > > > > >followers can be started blank.
> > > > > >
> > > > > >
> > > > > >
> > > > > >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <thawan@fb.com>
> > > wrote:
> > > > > >
> > > > > >> Just saw that  this is the corresponding use case to the
> question
> > > > posted
> > > > > >> in dev list.
> > > > > >>
> > > > > >> In order to restore the data to a given point in time correctly,
> > you
> > > > > >>need
> > > > > >> both snapshot and txnlog. This is because zookeeper snapshot
is
> > > fuzzy
> > > > > >>and
> > > > > >> snapshot alone may not represent a valid state of the server
if
> > > there
> > > > > >>are
> > > > > >> in-flight requests.
> > > > > >>
> > > > > >> The 4wl command should cause the server to roll the log
and
> take a
> > > > > >> snapshot similar to periodic snapshotting operation. Your
backup
> > > > script
> > > > > >> need grap the snapshot and corresponding txnlog file from
the
> data
> > > > dir.
> > > > > >>
> > > > > >> To restore, just shutdown all hosts, clear the data dir,
copy
> over
> > > the
> > > > > >> snapshot and txnlog, and restart them.
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Thawan Kooburat
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <evolvah@gmail.com>
> wrote:
> > > > > >>
> > > > > >> >Thank you for your response, Flavio. I apologize, I
did not
> > > provide a
> > > > > >> >clear
> > > > > >> >explanation of the use case.
> > > > > >> >
> > > > > >> >This backup/restore is not intended to be tied to any
write
> > event,
> > > > > >> >instead,
> > > > > >> >it is expected to run as a periodic (daily?) cron job
on one of
> > the
> > > > > >> >servers, which is not guaranteed to be the leader of
the
> > ensemble.
> > > > > >>There
> > > > > >> >is
> > > > > >> >no expectation that all recent changes are committed
and
> > persisted
> > > to
> > > > > >> >disk.
> > > > > >> >The system can sustain the loss of several hours worth
of
> recent
> > > > > >>changes
> > > > > >> >in
> > > > > >> >the event of restore.
> > > > > >> >
> > > > > >> >As for finding the leader dynamically and performing
backup on
> > it,
> > > > this
> > > > > >> >approach could be more difficult as the leader can change
time
> to
> > > > time
> > > > > >>and
> > > > > >> >I still need to fetch the file to store it in my designated
> > backup
> > > > > >> >location. Taking backup on one server and picking it
up from a
> > > local
> > > > > >>file
> > > > > >> >system looks less error-prone. Even if I went the fancy
route
> and
> > > had
> > > > > >> >Zookeeper send me the serialized DataTree in response
to the
> 4wl,
> > > > this
> > > > > >> >approach would involve a lot of moving parts.
> > > > > >> >
> > > > > >> >I have already made a PoC for a new 4wl that invokes
> > takeSnapshot()
> > > > and
> > > > > >> >returns an absolute path to the snapshot it drops on
disk. I
> have
> > > > > >>already
> > > > > >> >protected takeSnapshot() from concurrent invocation,
which is
> > > likely
> > > > to
> > > > > >> >corrupt the snapshot file on disk. This approach works
but I'm
> > > > > >>thinking to
> > > > > >> >take it one step further by providing the desired path
name as
> an
> > > > > >>argument
> > > > > >> >to my new 4lw and to have Zookeeper server drop the
snapshot
> into
> > > the
> > > > > >> >specified file and report success/failure back. This
way I can
> > > avoid
> > > > > >> >cluttering the data directory and interfering with what
> Zookeeper
> > > > finds
> > > > > >> >when it scans the data directory.
> > > > > >> >
> > > > > >> >Approach with having an additional server that would
take the
> > > > > >>leadership
> > > > > >> >and populate the ensemble is just a theory. I don't
see a clean
> > way
> > > > of
> > > > > >> >making a quorum member the leader of the quorum. Am
I
> overlooking
> > > > > >> >something
> > > > > >> >simple?
> > > > > >> >
> > > > > >> >In backup and restore of an ensemble the biggest unknown
for me
> > > > remains
> > > > > >> >populating the ensemble with desired data. I can think
of two
> > ways:
> > > > > >> >
> > > > > >> >1. Clear out all servers by stopping them, purge version-2
> > > > directories,
> > > > > >> >restore a snapshot file on one server that will be brought
> first,
> > > and
> > > > > >>then
> > > > > >> >bring up the rest of the ensemble. This way I somewhat
force
> the
> > > > first
> > > > > >> >server to be the leader because it has data and it will
be the
> > only
> > > > > >>member
> > > > > >> >of a quorum with data, provided to the way I start the
> ensemble.
> > > This
> > > > > >> >looks
> > > > > >> >like a hack, though.
> > > > > >> >
> > > > > >> >2. Clear out the ensemble and reload it with a dedicated
client
> > > using
> > > > > >>the
> > > > > >> >provided Zookeeper API.
> > > > > >> >
> > > > > >> >With the approach of backing up an actual snapshot file,
option
> > #1
> > > > > >>appears
> > > > > >> >to be more practical.
> > > > > >> >
> > > > > >> >I wish I could start the ensemble with a designate leader
that
> > > would
> > > > > >> >bootstrap the ensemble with data and then the ensemble
would go
> > > into
> > > > > >>its
> > > > > >> >normal business...
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> > > > > >> ><fpjunqueira@yahoo.com>wrote:
> > > > > >> >
> > > > > >> >> One bit that is still a bit confusing to me in
your use case
> is
> > > if
> > > > > >>you
> > > > > >> >> need to take a snapshot right after some event
in your
> > > application.
> > > > > >> >>Even if
> > > > > >> >> you're able to tell ZooKeeper to take a snapshot,
there is no
> > > > > >>guarantee
> > > > > >> >> that it will happen at the exact point you want
it if update
> > > > > >>operations
> > > > > >> >> keep coming.
> > > > > >> >>
> > > > > >> >> If you use your four-letter word approach, then
would you
> > search
> > > > for
> > > > > >>the
> > > > > >> >> leader or would you simply take a snapshot at any
server? If
> it
> > > has
> > > > > >>to
> > > > > >> >>go
> > > > > >> >> through the leader so that you make sure to have
the most
> > recent
> > > > > >> >>committed
> > > > > >> >> state, then it might not be a bad idea to have
an api call
> that
> > > > tells
> > > > > >> >>the
> > > > > >> >> leader to take a snapshot at some directory of
your choice.
> > > > Informing
> > > > > >> >>you
> > > > > >> >> the name of the snapshot file so that you can copy
sounds
> like
> > an
> > > > > >> >>option,
> > > > > >> >> but perhaps it is not as convenient.
> > > > > >> >>
> > > > > >> >> The approach of adding another server is not very
clear. How
> do
> > > you
> > > > > >> >>force
> > > > > >> >> it to be the leader? Keep in mind that if it crashes,
then it
> > > will
> > > > > >>lose
> > > > > >> >> leadership.
> > > > > >> >>
> > > > > >> >> -Flavio
> > > > > >> >>
> > > > > >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <
> > evolvah@gmail.com>
> > > > > >>wrote:
> > > > > >> >>
> > > > > >> >> > It looks like the "dev" mailing list is rather
inactive.
> Over
> > > the
> > > > > >>past
> > > > > >> >> few
> > > > > >> >> > days I only saw several automated emails from
JIRA and this
> > is
> > > > > >>pretty
> > > > > >> >> much
> > > > > >> >> > it. Contrary to this, the "user" mailing list
seems to be
> > more
> > > > > >>alive
> > > > > >> >>and
> > > > > >> >> > more populated.
> > > > > >> >> >
> > > > > >> >> > With this in mind, please allow me to cross-post
here the
> > > > message I
> > > > > >> >>sent
> > > > > >> >> > into the "dev" list a few days ago.
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > Regards,
> > > > > >> >> > /Sergey
> > > > > >> >> >
> > > > > >> >> > === forwarded message begins here ===
> > > > > >> >> >
> > > > > >> >> > Hi!
> > > > > >> >> >
> > > > > >> >> > I'm facing the problem that has been raised
by multiple
> > people
> > > > but
> > > > > >> >>none
> > > > > >> >> of
> > > > > >> >> > the discussion threads seem to provide a good
answer. I dug
> > in
> > > > > >> >>Zookeeper
> > > > > >> >> > source code trying to come up with some possible
approaches
> > > and I
> > > > > >> >>would
> > > > > >> >> > like to get your inputs on those.
> > > > > >> >> >
> > > > > >> >> > Initial conditions:
> > > > > >> >> >
> > > > > >> >> > * I have an ensemble of five Zookeeper servers
running
> v3.4.5
> > > > code.
> > > > > >> >> > * The size of a committed snapshot file is
in vicinity of
> > 1GB.
> > > > > >> >> > * There are about 80 clients connected to
the ensemble.
> > > > > >> >> > * Clients a heavily read biased, i.e., they
mostly read and
> > > > rarely
> > > > > >> >> write. I
> > > > > >> >> > would say less than 0.1% of queries modify
the data.
> > > > > >> >> >
> > > > > >> >> > Problem statement:
> > > > > >> >> >
> > > > > >> >> > * Under certain conditions, I may need to
revert the data
> > > stored
> > > > in
> > > > > >> >>the
> > > > > >> >> > ensemble to an earlier state. For example,
one of the
> clients
> > > may
> > > > > >>ruin
> > > > > >> >> the
> > > > > >> >> > application-level data integrity and I need
to perform a
> > > disaster
> > > > > >> >> recovery.
> > > > > >> >> >
> > > > > >> >> > Things look nice and easy if I'm dealing with
a single
> > > Zookeeper
> > > > > >> >>server.
> > > > > >> >> A
> > > > > >> >> > file-level copy of the data and dataLog directories
should
> > > allow
> > > > > >>me to
> > > > > >> >> > recover later by stopping Zookeeper, swapping
the corrupted
> > > data
> > > > > >>and
> > > > > >> >> > dataLog directories with a backup, and firing
Zookeeper
> back
> > > up.
> > > > > >> >> >
> > > > > >> >> > Now, the ensemble deployment and the leader
election
> > algorithm
> > > in
> > > > > >>the
> > > > > >> >> > quorum make things much more difficult. In
order to restore
> > > from
> > > > a
> > > > > >> >>single
> > > > > >> >> > file-level backup, I need to take the whole
ensemble down,
> > wipe
> > > > out
> > > > > >> >>data
> > > > > >> >> > and dataLog directories on all servers, replace
these
> > > directories
> > > > > >>with
> > > > > >> >> > backed up content on one of the servers, bring
this server
> up
> > > > > >>first,
> > > > > >> >>and
> > > > > >> >> > then bring up the rest of the ensemble. This
[somewhat]
> > > > guarantees
> > > > > >> >>that
> > > > > >> >> the
> > > > > >> >> > populated Zookeeper server becomes a member
of a majority
> and
> > > > > >> >>populates
> > > > > >> >> the
> > > > > >> >> > ensemble. This approach works but it is very
involving and,
> > > thus,
> > > > > >> >> > error-prone due to a human error.
> > > > > >> >> >
> > > > > >> >> > Based on a study of Zookeeper source code,
I am considering
> > the
> > > > > >> >>following
> > > > > >> >> > alternatives. And I seek advice from Zookeeper
development
> > > > > >>community
> > > > > >> >>as
> > > > > >> >> to
> > > > > >> >> > which approach looks more promising or if
there is a better
> > > way.
> > > > > >> >> >
> > > > > >> >> > Approach #1:
> > > > > >> >> >
> > > > > >> >> > Develop a complementary pair of utilities
for export and
> > import
> > > > of
> > > > > >>the
> > > > > >> >> > data. Both utilities will act as Zookeeper
clients and use
> > the
> > > > > >> >>existing
> > > > > >> >> > API. The "export" utility will recursively
retrieve data
> and
> > > > store
> > > > > >>it
> > > > > >> >>in
> > > > > >> >> a
> > > > > >> >> > file. The "import" utility will first purge
all data from
> the
> > > > > >>ensemble
> > > > > >> >> and
> > > > > >> >> > then reload it from the file.
> > > > > >> >> >
> > > > > >> >> > This approach seems to be the simplest and
there are
> similar
> > > > tools
> > > > > >> >> > developed already. For example, the Guano
Project:
> > > > > >> >> > https://github.com/d2fn/guano
> > > > > >> >> >
> > > > > >> >> > I don't like two things about it:
> > > > > >> >> > * Poor performance even on a backup for the
data store of
> my
> > > > size.
> > > > > >> >> > * Possible data consistency issues due to
concurrent access
> > by
> > > > the
> > > > > >> >>export
> > > > > >> >> > utility as well as other "normal" clients.
> > > > > >> >> >
> > > > > >> >> > Approach #2:
> > > > > >> >> >
> > > > > >> >> > Add another four-letter command that would
force rolling up
> > the
> > > > > >> >> > transactions and creating a snapshot. The
result of this
> > > command
> > > > > >>would
> > > > > >> >> be a
> > > > > >> >> > new snapshot.XXXX file on disk and the name
of the file
> could
> > > be
> > > > > >> >>reported
> > > > > >> >> > back to the client as a response to the four-letter
> command.
> > > This
> > > > > >> >>way, I
> > > > > >> >> > would know which snapshot file to grab for
future possible
> > > > restore.
> > > > > >> >>But
> > > > > >> >> > restoring from a snapshot file is almost as
involving as
> the
> > > > > >> >>error-prone
> > > > > >> >> > sequence described in the "Initial conditions"
above.
> > > > > >> >> >
> > > > > >> >> > Approach #3:
> > > > > >> >> >
> > > > > >> >> > Come up with a way to temporarily add a new
Zookeeper
> server
> > > > into a
> > > > > >> >>live
> > > > > >> >> > ensemble, that would overtake (how?) the leader
role and
> push
> > > out
> > > > > >>the
> > > > > >> >> > snapshot that it has into all ensemble members
upon
> restore.
> > > This
> > > > > >> >> approach
> > > > > >> >> > could be difficult and error-prone to implement
because it
> > will
> > > > > >> >>require
> > > > > >> >> > hacking the existing election algorithm to
designate a
> > leader.
> > > > > >> >> >
> > > > > >> >> > So, which of the approaches do you think works
best for an
> > > > ensemble
> > > > > >> >>and
> > > > > >> >> for
> > > > > >> >> > the database size of about 1GB?
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > Any advice will be highly appreciated!
> > > > > >> >> > /Sergey
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message