Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A8C47F919 for ; Tue, 9 Jul 2013 04:42:46 +0000 (UTC) Received: (qmail 33638 invoked by uid 500); 9 Jul 2013 04:42:44 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 33612 invoked by uid 500); 9 Jul 2013 04:42:43 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 33603 invoked by uid 99); 9 Jul 2013 04:42:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Jul 2013 04:42:42 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of evolvah@gmail.com designates 209.85.128.48 as permitted sender) Received: from [209.85.128.48] (HELO mail-qe0-f48.google.com) (209.85.128.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Jul 2013 04:42:37 +0000 Received: by mail-qe0-f48.google.com with SMTP id 2so2754438qea.21 for ; Mon, 08 Jul 2013 21:42:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=s7XYSgDqh28skILrZ5Hy+Fc0hO4SEBNZigULWNyIo04=; b=KKxcZANLrqSTeYWfKb+zAEWxT3ngBBMWJ+nl9aKW8oa3cBrRRWHaFcPqlYnNYk8dGO +B4nuMaqxeApjUzrA3RCQizQfYhTPea1kqhwRjaYfYdMJ2qlh52V2LyucilqkGxFVe35 Xhwg0WkMcn6onrmm3WQo7nB1mnL+uvOD+Du6blljEamRY9w4CAolHOxc0rQ59f8k6Ouu 3+xgMaEvToN767zCnSfrdgP4wap9kqAg5OCFIhLVyAlQcuxU/xDe5T8ATHz9xRrtcAO0 G5mN7URVygKmm0R9rp9hys7uZAViLiwG2lsks4YTvgdR+j8UcW026INx+a291BF2hTF6 JLDQ== MIME-Version: 1.0 X-Received: by 10.49.95.97 with SMTP id dj1mr18929669qeb.46.1373344936880; Mon, 08 Jul 2013 21:42:16 -0700 (PDT) Received: by 10.49.57.163 with HTTP; Mon, 8 Jul 2013 21:42:16 -0700 (PDT) In-Reply-To: References: Date: Mon, 8 Jul 2013 23:42:16 -0500 Message-ID: Subject: Re: Efficient backup and a reasonable restore of an ensemble From: Sergey Maslyakov To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=047d7b6da6161df22a04e10cc9db X-Virus-Checked: Checked by ClamAV on apache.org --047d7b6da6161df22a04e10cc9db Content-Type: text/plain; charset=ISO-8859-1 Kishore, This sounds like a very elaborate tool. I was trying to find a simplistic approach but what Thawan said about "fuzzy snapshots" makes me a little afraid that there is no simple solution. On Mon, Jul 8, 2013 at 11:05 PM, kishore g wrote: > Agree, we already have such a tool. In fact we use it to reconstruct the > sequence of events that led to a failure and actually restore the system to > a previous stable point and replay the events. Unfortunately this is tied > closely with Helix but it should be easy to make this a generic tool. > > Sergey is this something that will be useful in your case. > > Thanks, > Kishore G > > > On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat wrote: > > > On restore part, I think having a separate utility to manipulate the > > data/snap dir (by truncating the log/removing snapshot to a given zxid) > > would be easier than modifying the server. > > > > > > -- > > Thawan Kooburat > > > > > > > > > > > > On 7/8/13 6:34 PM, "kishore g" wrote: > > > > >I think what we are looking at is a point in time restore > functionality. > > >How about adding a feature that says go back to a specific > zxid/timestamp. > > >This way before doing any change to zookeeper simply note down the > > >timestamp/zxid on leader. If things go wrong after making changes, bring > > >down zookeepers and provide additional parameter of a zxid/timestamp > while > > >restarting. The server can go the exact point and make it current. The > > >followers can be started blank. > > > > > > > > > > > >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat wrote: > > > > > >> Just saw that this is the corresponding use case to the question > posted > > >> in dev list. > > >> > > >> In order to restore the data to a given point in time correctly, you > > >>need > > >> both snapshot and txnlog. This is because zookeeper snapshot is fuzzy > > >>and > > >> snapshot alone may not represent a valid state of the server if there > > >>are > > >> in-flight requests. > > >> > > >> The 4wl command should cause the server to roll the log and take a > > >> snapshot similar to periodic snapshotting operation. Your backup > script > > >> need grap the snapshot and corresponding txnlog file from the data > dir. > > >> > > >> To restore, just shutdown all hosts, clear the data dir, copy over the > > >> snapshot and txnlog, and restart them. > > >> > > >> > > >> -- > > >> Thawan Kooburat > > >> > > >> > > >> > > >> > > >> > > >> On 7/8/13 3:28 PM, "Sergey Maslyakov" wrote: > > >> > > >> >Thank you for your response, Flavio. I apologize, I did not provide a > > >> >clear > > >> >explanation of the use case. > > >> > > > >> >This backup/restore is not intended to be tied to any write event, > > >> >instead, > > >> >it is expected to run as a periodic (daily?) cron job on one of the > > >> >servers, which is not guaranteed to be the leader of the ensemble. > > >>There > > >> >is > > >> >no expectation that all recent changes are committed and persisted to > > >> >disk. > > >> >The system can sustain the loss of several hours worth of recent > > >>changes > > >> >in > > >> >the event of restore. > > >> > > > >> >As for finding the leader dynamically and performing backup on it, > this > > >> >approach could be more difficult as the leader can change time to > time > > >>and > > >> >I still need to fetch the file to store it in my designated backup > > >> >location. Taking backup on one server and picking it up from a local > > >>file > > >> >system looks less error-prone. Even if I went the fancy route and had > > >> >Zookeeper send me the serialized DataTree in response to the 4wl, > this > > >> >approach would involve a lot of moving parts. > > >> > > > >> >I have already made a PoC for a new 4wl that invokes takeSnapshot() > and > > >> >returns an absolute path to the snapshot it drops on disk. I have > > >>already > > >> >protected takeSnapshot() from concurrent invocation, which is likely > to > > >> >corrupt the snapshot file on disk. This approach works but I'm > > >>thinking to > > >> >take it one step further by providing the desired path name as an > > >>argument > > >> >to my new 4lw and to have Zookeeper server drop the snapshot into the > > >> >specified file and report success/failure back. This way I can avoid > > >> >cluttering the data directory and interfering with what Zookeeper > finds > > >> >when it scans the data directory. > > >> > > > >> >Approach with having an additional server that would take the > > >>leadership > > >> >and populate the ensemble is just a theory. I don't see a clean way > of > > >> >making a quorum member the leader of the quorum. Am I overlooking > > >> >something > > >> >simple? > > >> > > > >> >In backup and restore of an ensemble the biggest unknown for me > remains > > >> >populating the ensemble with desired data. I can think of two ways: > > >> > > > >> >1. Clear out all servers by stopping them, purge version-2 > directories, > > >> >restore a snapshot file on one server that will be brought first, and > > >>then > > >> >bring up the rest of the ensemble. This way I somewhat force the > first > > >> >server to be the leader because it has data and it will be the only > > >>member > > >> >of a quorum with data, provided to the way I start the ensemble. This > > >> >looks > > >> >like a hack, though. > > >> > > > >> >2. Clear out the ensemble and reload it with a dedicated client using > > >>the > > >> >provided Zookeeper API. > > >> > > > >> >With the approach of backing up an actual snapshot file, option #1 > > >>appears > > >> >to be more practical. > > >> > > > >> >I wish I could start the ensemble with a designate leader that would > > >> >bootstrap the ensemble with data and then the ensemble would go into > > >>its > > >> >normal business... > > >> > > > >> > > > >> > > > >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira > > >> >wrote: > > >> > > > >> >> One bit that is still a bit confusing to me in your use case is if > > >>you > > >> >> need to take a snapshot right after some event in your application. > > >> >>Even if > > >> >> you're able to tell ZooKeeper to take a snapshot, there is no > > >>guarantee > > >> >> that it will happen at the exact point you want it if update > > >>operations > > >> >> keep coming. > > >> >> > > >> >> If you use your four-letter word approach, then would you search > for > > >>the > > >> >> leader or would you simply take a snapshot at any server? If it has > > >>to > > >> >>go > > >> >> through the leader so that you make sure to have the most recent > > >> >>committed > > >> >> state, then it might not be a bad idea to have an api call that > tells > > >> >>the > > >> >> leader to take a snapshot at some directory of your choice. > Informing > > >> >>you > > >> >> the name of the snapshot file so that you can copy sounds like an > > >> >>option, > > >> >> but perhaps it is not as convenient. > > >> >> > > >> >> The approach of adding another server is not very clear. How do you > > >> >>force > > >> >> it to be the leader? Keep in mind that if it crashes, then it will > > >>lose > > >> >> leadership. > > >> >> > > >> >> -Flavio > > >> >> > > >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov > > >>wrote: > > >> >> > > >> >> > It looks like the "dev" mailing list is rather inactive. Over the > > >>past > > >> >> few > > >> >> > days I only saw several automated emails from JIRA and this is > > >>pretty > > >> >> much > > >> >> > it. Contrary to this, the "user" mailing list seems to be more > > >>alive > > >> >>and > > >> >> > more populated. > > >> >> > > > >> >> > With this in mind, please allow me to cross-post here the > message I > > >> >>sent > > >> >> > into the "dev" list a few days ago. > > >> >> > > > >> >> > > > >> >> > Regards, > > >> >> > /Sergey > > >> >> > > > >> >> > === forwarded message begins here === > > >> >> > > > >> >> > Hi! > > >> >> > > > >> >> > I'm facing the problem that has been raised by multiple people > but > > >> >>none > > >> >> of > > >> >> > the discussion threads seem to provide a good answer. I dug in > > >> >>Zookeeper > > >> >> > source code trying to come up with some possible approaches and I > > >> >>would > > >> >> > like to get your inputs on those. > > >> >> > > > >> >> > Initial conditions: > > >> >> > > > >> >> > * I have an ensemble of five Zookeeper servers running v3.4.5 > code. > > >> >> > * The size of a committed snapshot file is in vicinity of 1GB. > > >> >> > * There are about 80 clients connected to the ensemble. > > >> >> > * Clients a heavily read biased, i.e., they mostly read and > rarely > > >> >> write. I > > >> >> > would say less than 0.1% of queries modify the data. > > >> >> > > > >> >> > Problem statement: > > >> >> > > > >> >> > * Under certain conditions, I may need to revert the data stored > in > > >> >>the > > >> >> > ensemble to an earlier state. For example, one of the clients may > > >>ruin > > >> >> the > > >> >> > application-level data integrity and I need to perform a disaster > > >> >> recovery. > > >> >> > > > >> >> > Things look nice and easy if I'm dealing with a single Zookeeper > > >> >>server. > > >> >> A > > >> >> > file-level copy of the data and dataLog directories should allow > > >>me to > > >> >> > recover later by stopping Zookeeper, swapping the corrupted data > > >>and > > >> >> > dataLog directories with a backup, and firing Zookeeper back up. > > >> >> > > > >> >> > Now, the ensemble deployment and the leader election algorithm in > > >>the > > >> >> > quorum make things much more difficult. In order to restore from > a > > >> >>single > > >> >> > file-level backup, I need to take the whole ensemble down, wipe > out > > >> >>data > > >> >> > and dataLog directories on all servers, replace these directories > > >>with > > >> >> > backed up content on one of the servers, bring this server up > > >>first, > > >> >>and > > >> >> > then bring up the rest of the ensemble. This [somewhat] > guarantees > > >> >>that > > >> >> the > > >> >> > populated Zookeeper server becomes a member of a majority and > > >> >>populates > > >> >> the > > >> >> > ensemble. This approach works but it is very involving and, thus, > > >> >> > error-prone due to a human error. > > >> >> > > > >> >> > Based on a study of Zookeeper source code, I am considering the > > >> >>following > > >> >> > alternatives. And I seek advice from Zookeeper development > > >>community > > >> >>as > > >> >> to > > >> >> > which approach looks more promising or if there is a better way. > > >> >> > > > >> >> > Approach #1: > > >> >> > > > >> >> > Develop a complementary pair of utilities for export and import > of > > >>the > > >> >> > data. Both utilities will act as Zookeeper clients and use the > > >> >>existing > > >> >> > API. The "export" utility will recursively retrieve data and > store > > >>it > > >> >>in > > >> >> a > > >> >> > file. The "import" utility will first purge all data from the > > >>ensemble > > >> >> and > > >> >> > then reload it from the file. > > >> >> > > > >> >> > This approach seems to be the simplest and there are similar > tools > > >> >> > developed already. For example, the Guano Project: > > >> >> > https://github.com/d2fn/guano > > >> >> > > > >> >> > I don't like two things about it: > > >> >> > * Poor performance even on a backup for the data store of my > size. > > >> >> > * Possible data consistency issues due to concurrent access by > the > > >> >>export > > >> >> > utility as well as other "normal" clients. > > >> >> > > > >> >> > Approach #2: > > >> >> > > > >> >> > Add another four-letter command that would force rolling up the > > >> >> > transactions and creating a snapshot. The result of this command > > >>would > > >> >> be a > > >> >> > new snapshot.XXXX file on disk and the name of the file could be > > >> >>reported > > >> >> > back to the client as a response to the four-letter command. This > > >> >>way, I > > >> >> > would know which snapshot file to grab for future possible > restore. > > >> >>But > > >> >> > restoring from a snapshot file is almost as involving as the > > >> >>error-prone > > >> >> > sequence described in the "Initial conditions" above. > > >> >> > > > >> >> > Approach #3: > > >> >> > > > >> >> > Come up with a way to temporarily add a new Zookeeper server > into a > > >> >>live > > >> >> > ensemble, that would overtake (how?) the leader role and push out > > >>the > > >> >> > snapshot that it has into all ensemble members upon restore. This > > >> >> approach > > >> >> > could be difficult and error-prone to implement because it will > > >> >>require > > >> >> > hacking the existing election algorithm to designate a leader. > > >> >> > > > >> >> > So, which of the approaches do you think works best for an > ensemble > > >> >>and > > >> >> for > > >> >> > the database size of about 1GB? > > >> >> > > > >> >> > > > >> >> > Any advice will be highly appreciated! > > >> >> > /Sergey > > >> >> > > >> >> > > >> > > >> > > > > > --047d7b6da6161df22a04e10cc9db--