From dev-return-11253-apmail-couchdb-dev-archive=couchdb.apache.org@couchdb.apache.org Wed Aug 11 17:05:51 2010 Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 25840 invoked from network); 11 Aug 2010 17:05:48 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 Aug 2010 17:05:48 -0000 Received: (qmail 92065 invoked by uid 500); 11 Aug 2010 17:05:48 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 91937 invoked by uid 500); 11 Aug 2010 17:05:47 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 91929 invoked by uid 99); 11 Aug 2010 17:05:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Aug 2010 17:05:47 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of fdmanana@gmail.com designates 209.85.161.52 as permitted sender) Received: from [209.85.161.52] (HELO mail-fx0-f52.google.com) (209.85.161.52) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Aug 2010 17:05:42 +0000 Received: by fxm10 with SMTP id 10so352056fxm.11 for ; Wed, 11 Aug 2010 10:05:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=808PfaYx4pEQw5OrgqUt5kE/YtnkFAIoIOruvlwzeCM=; b=PwHx2hGPNineF+SuXgWe0NH/+ZVH4dLS5DWWEt04yKwqTZIzBF0s1n8tqC/o/k/qRv u5g30q0Zu50/FC5B9PkKwPMqVh02hwRknctx8W+mRBUUVPNvx3W4439F187kbzmV2HA4 Q1au9+tEl5BaLZe2N5IB6w4t6EuCOS2v/vQaE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; b=KXh4t7SB3DnOhPFUvdDg6z2n5BQ9kgXehFYkhCczBfasi5VZYoIvvZHx2hrlxk4nzc XzXa/LP2dOdNn7TVa8pq0eNVkL5UwJEV9kuLe+Hw2MYxl2QyycWUl565adc45VSKkgpj 19JVBPqqZtfpBSNOhvNLjaS18W9uv9X106TZw= MIME-Version: 1.0 Received: by 10.239.160.12 with SMTP id a12mr1027882hbd.81.1281546320323; Wed, 11 Aug 2010 10:05:20 -0700 (PDT) Sender: fdmanana@gmail.com Received: by 10.239.173.202 with HTTP; Wed, 11 Aug 2010 10:05:20 -0700 (PDT) In-Reply-To: <486C68C5-9623-46A2-9681-CBC2BEC56BDA@apache.org> References: <1690416A-4C01-4756-9D3B-A256DC729813@apache.org> <154AD543-C787-441C-851B-D59CEA6765CC@apache.org> <5F47BBB4-9F58-4EFE-92C8-B0FEDA5B01B7@apache.org> <12229601-B7B8-4E98-931E-054DA00C5092@apache.org> <20100810130338.GA2584@two> <9A625192-F6F5-4AF4-A71E-BE0082789AA5@apache.org> <69F9CA20-2EE8-4AA0-9D4B-084EB994D920@apache.org> <594EF248-98DE-4F10-9C8F-2083EA2DEBE0@apache.org> <9A34A746-AED9-4FA5-A60E-A40877681C71@apache.org> <486C68C5-9623-46A2-9681-CBC2BEC56BDA@apache.org> Date: Wed, 11 Aug 2010 18:05:20 +0100 X-Google-Sender-Auth: CzxyfOsHpzw7nXoFs00DKyP-rYQ Message-ID: Subject: Re: data recovery tool progress From: Filipe David Manana To: dev@couchdb.apache.org Content-Type: multipart/alternative; boundary=001485ed57482fa3af048d8f41be --001485ed57482fa3af048d8f41be Content-Type: text/plain; charset=UTF-8 On Wed, Aug 11, 2010 at 3:52 AM, Adam Kocoloski wrote: > Excellent, thanks for testing. I caught Jason Smith saying on IRC that he > had packaged the whole thing up as an escript + some .beams. If we can get > it down to a single file a la rebar that would be a pretty sweet way to > deliver the repair tool in my opinion. > +1 > > Adam > > On Aug 10, 2010, at 10:40 PM, Mikeal Rogers wrote: > > > Ok, latest code has been tested against every db that I have and it works > > great. > > > > What are our next steps here? > > > > I'd like to get this out to all the people who didn't feel comfortable > send > > me their db to test against before we release it more widely. > > > > -Mikeal > > > > On Tue, Aug 10, 2010 at 6:11 PM, Mikeal Rogers >wrote: > > > >> Found one issue, we weren't picking up design docs because it didn't > have > >> admin privileges. > >> > >> Adam fixed it and pushed and I've verified that it works now. > >> > >> I wrote a little node script to show all recovered documents and expose > any > >> documents that didn't make it in to lost+found. > >> > >> http://github.com/mikeal/couchtest/blob/master/validate.js > >> > >> Requires request, `npm install request`. > >> > >> I'm now running recover on all the test db's I have and running the > >> validation script against them. > >> > >> -Mikeal > >> > >> > >> On Tue, Aug 10, 2010 at 1:34 PM, Mikeal Rogers >wrote: > >> > >>> I have some timing number for the new code. > >>> > >>> multi_conflict has 200 lost documents and 201 documents total after > >>> recovery. > >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["multi_conflict"]). > >>> {25217069,ok} > >>> 25 seconds > >>> > >>> Something funky is going on here. Investigating. > >>> 1> timer:tc(couch_db_repair, make_lost_and_found, > >>> ["multi_conflict_with_attach"]). > >>> {654782,ok} > >>> .6 seconds > >>> > >>> This db has 124969 documents in it. > >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["testwritesdb"]). > >>> {1381969304,ok} > >>> 23 minutes > >>> > >>> This database is about 500megs and 46660 before recovery and 46801 > after. > >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["prod"]). > >>> {2329669113,ok} > >>> 38.8 minutes > >>> > >>> -Mikeal > >>> > >>> On Tue, Aug 10, 2010 at 12:06 PM, Adam Kocoloski >wrote: > >>> > >>>> Good idea. Now we've got > >>>> > >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 1048576 > >>>> bytes at 1380102 > >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 1048576 > >>>> bytes at 331526 > >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 331526 > >>>> bytes at 0 > >>>>> [info] [<0.33.0>] couch_db_repair writing 12 updates to > >>>> lost+found/testwritesdb > >>>>> [info] [<0.33.0>] couch_db_repair writing 9 updates to > >>>> lost+found/testwritesdb > >>>>> [info] [<0.33.0>] couch_db_repair writing 8 updates to > >>>> lost+found/testwritesdb > >>>> > >>>> Adam > >>>> > >>>> On Aug 10, 2010, at 2:29 PM, Robert Newson wrote: > >>>> > >>>>> It took 20 minutes before the first 'update' line came out, but now > >>>>> seems to be recovering smoothly. machine load is back down to sane > >>>>> levels. > >>>>> > >>>>> Suggest feedback during the hunting phase. > >>>>> > >>>>> B. > >>>>> > >>>>> On Tue, Aug 10, 2010 at 7:11 PM, Adam Kocoloski > > >>>> wrote: > >>>>>> Thanks for the crosscheck. I'm not aware of anything in the node > >>>> finder that would cause it to struggle mightily with healthy DBs. It > pretty > >>>> much ignores the health of the DB, in fact. Would be interested to > hear > >>>> more. > >>>>>> > >>>>>> On Aug 10, 2010, at 1:59 PM, Robert Newson wrote: > >>>>>> > >>>>>>> I verified the new code's ability to repair the testwritesdb. > system > >>>>>>> load was smooth from start to finish. > >>>>>>> > >>>>>>> I started a further test on a different (healthy) database and > system > >>>>>>> load was severe again, just collecting the roots (the lost+found db > >>>>>>> was not yet created when I aborted the attempt). I suspect the fact > >>>>>>> that it's healthy is the issue, so if I'm right, perhaps a warning > is > >>>>>>> useful. > >>>>>>> > >>>>>>> B. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Aug 10, 2010 at 6:53 PM, Adam Kocoloski < > kocolosk@apache.org> > >>>> wrote: > >>>>>>>> Another update. This morning I took a different tack and, rather > >>>> than try to find root nodes, I just looked for all kv_nodes in the > file and > >>>> treated each of those as a separate virtual DB to be replicated. This > >>>> reduces the algorithmic complexity of the repair, and it looks like > >>>> testwritesdb repairs in ~30 minutes or so. Also, this method results > in the > >>>> lost+found DB containing every document, not just the missing ones. > >>>>>>>> > >>>>>>>> My branch does not currently include Randall's parallelization of > >>>> the replications. It's still CPU-limited, so that may be a worthwhile > >>>> optimization. On the other hand, I think we may be reaching a stage > at > >>>> which performance for this repair tool is 'good enough', and pmaps can > make > >>>> error handling a bit dicey. > >>>>>>>> > >>>>>>>> In short, I think this tool is now in good shape. > >>>>>>>> > >>>>>>>> http://github.com/kocolosk/couchdb/tree/db_repair > >>>>>>>> > >>>>>> > >>>>>> > >>>> > >>>> > >>> > >> > > -- Filipe David Manana, fdmanana@gmail.com, fdmanana@apache.org "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." --001485ed57482fa3af048d8f41be--