From dev-return-11242-apmail-couchdb-dev-archive=couchdb.apache.org@couchdb.apache.org Wed Aug 11 02:52:46 2010 Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 14479 invoked from network); 11 Aug 2010 02:52:45 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 Aug 2010 02:52:45 -0000 Received: (qmail 33742 invoked by uid 500); 11 Aug 2010 02:52:45 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 33407 invoked by uid 500); 11 Aug 2010 02:52:43 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 33399 invoked by uid 99); 11 Aug 2010 02:52:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Aug 2010 02:52:43 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of adam.kocoloski@gmail.com designates 209.85.216.52 as permitted sender) Received: from [209.85.216.52] (HELO mail-qw0-f52.google.com) (209.85.216.52) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Aug 2010 02:52:37 +0000 Received: by qwj8 with SMTP id 8so35894qwj.11 for ; Tue, 10 Aug 2010 19:52:16 -0700 (PDT) Received: by 10.229.251.132 with SMTP id ms4mr4696703qcb.205.1281495136286; Tue, 10 Aug 2010 19:52:16 -0700 (PDT) Received: from [10.0.1.4] (c-71-232-49-44.hsd1.ma.comcast.net [71.232.49.44]) by mx.google.com with ESMTPS id r36sm1758375qcs.15.2010.08.10.19.52.14 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 10 Aug 2010 19:52:15 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1081) Subject: Re: data recovery tool progress From: Adam Kocoloski In-Reply-To: Date: Tue, 10 Aug 2010 22:52:13 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <486C68C5-9623-46A2-9681-CBC2BEC56BDA@apache.org> References: <1690416A-4C01-4756-9D3B-A256DC729813@apache.org> <154AD543-C787-441C-851B-D59CEA6765CC@apache.org> <5F47BBB4-9F58-4EFE-92C8-B0FEDA5B01B7@apache.org> <12229601-B7B8-4E98-931E-054DA00C5092@apache.org> <20100810130338.GA2584@two> <9A625192-F6F5-4AF4-A71E-BE0082789AA5@apache.org> <69F9CA20-2EE8-4AA0-9D4B-084EB994D920@apache.org> <594EF248-98DE-4F10-9C8F-2083EA2DEBE0@apache.org> <9A34A746-AED9-4FA5-A60E-A40877681C71@apache.org> <6527C48C-BD7C-405C-99D4-8193F0D75631@apache.o rg> To: dev@couchdb.apache.org X-Mailer: Apple Mail (2.1081) Excellent, thanks for testing. I caught Jason Smith saying on IRC that = he had packaged the whole thing up as an escript + some .beams. If we = can get it down to a single file a la rebar that would be a pretty sweet = way to deliver the repair tool in my opinion. Adam On Aug 10, 2010, at 10:40 PM, Mikeal Rogers wrote: > Ok, latest code has been tested against every db that I have and it = works > great. >=20 > What are our next steps here? >=20 > I'd like to get this out to all the people who didn't feel comfortable = send > me their db to test against before we release it more widely. >=20 > -Mikeal >=20 > On Tue, Aug 10, 2010 at 6:11 PM, Mikeal Rogers = wrote: >=20 >> Found one issue, we weren't picking up design docs because it didn't = have >> admin privileges. >>=20 >> Adam fixed it and pushed and I've verified that it works now. >>=20 >> I wrote a little node script to show all recovered documents and = expose any >> documents that didn't make it in to lost+found. >>=20 >> http://github.com/mikeal/couchtest/blob/master/validate.js >>=20 >> Requires request, `npm install request`. >>=20 >> I'm now running recover on all the test db's I have and running the >> validation script against them. >>=20 >> -Mikeal >>=20 >>=20 >> On Tue, Aug 10, 2010 at 1:34 PM, Mikeal Rogers = wrote: >>=20 >>> I have some timing number for the new code. >>>=20 >>> multi_conflict has 200 lost documents and 201 documents total after >>> recovery. >>> 1> timer:tc(couch_db_repair, make_lost_and_found, = ["multi_conflict"]). >>> {25217069,ok} >>> 25 seconds >>>=20 >>> Something funky is going on here. Investigating. >>> 1> timer:tc(couch_db_repair, make_lost_and_found, >>> ["multi_conflict_with_attach"]). >>> {654782,ok} >>> .6 seconds >>>=20 >>> This db has 124969 documents in it. >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["testwritesdb"]). >>> {1381969304,ok} >>> 23 minutes >>>=20 >>> This database is about 500megs and 46660 before recovery and 46801 = after. >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["prod"]). >>> {2329669113,ok} >>> 38.8 minutes >>>=20 >>> -Mikeal >>>=20 >>> On Tue, Aug 10, 2010 at 12:06 PM, Adam Kocoloski = wrote: >>>=20 >>>> Good idea. Now we've got >>>>=20 >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning = 1048576 >>>> bytes at 1380102 >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning = 1048576 >>>> bytes at 331526 >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning = 331526 >>>> bytes at 0 >>>>> [info] [<0.33.0>] couch_db_repair writing 12 updates to >>>> lost+found/testwritesdb >>>>> [info] [<0.33.0>] couch_db_repair writing 9 updates to >>>> lost+found/testwritesdb >>>>> [info] [<0.33.0>] couch_db_repair writing 8 updates to >>>> lost+found/testwritesdb >>>>=20 >>>> Adam >>>>=20 >>>> On Aug 10, 2010, at 2:29 PM, Robert Newson wrote: >>>>=20 >>>>> It took 20 minutes before the first 'update' line came out, but = now >>>>> seems to be recovering smoothly. machine load is back down to sane >>>>> levels. >>>>>=20 >>>>> Suggest feedback during the hunting phase. >>>>>=20 >>>>> B. >>>>>=20 >>>>> On Tue, Aug 10, 2010 at 7:11 PM, Adam Kocoloski = >>>> wrote: >>>>>> Thanks for the crosscheck. I'm not aware of anything in the node >>>> finder that would cause it to struggle mightily with healthy DBs. = It pretty >>>> much ignores the health of the DB, in fact. Would be interested to = hear >>>> more. >>>>>>=20 >>>>>> On Aug 10, 2010, at 1:59 PM, Robert Newson wrote: >>>>>>=20 >>>>>>> I verified the new code's ability to repair the testwritesdb. = system >>>>>>> load was smooth from start to finish. >>>>>>>=20 >>>>>>> I started a further test on a different (healthy) database and = system >>>>>>> load was severe again, just collecting the roots (the lost+found = db >>>>>>> was not yet created when I aborted the attempt). I suspect the = fact >>>>>>> that it's healthy is the issue, so if I'm right, perhaps a = warning is >>>>>>> useful. >>>>>>>=20 >>>>>>> B. >>>>>>>=20 >>>>>>>=20 >>>>>>>=20 >>>>>>> On Tue, Aug 10, 2010 at 6:53 PM, Adam Kocoloski = >>>> wrote: >>>>>>>> Another update. This morning I took a different tack and, = rather >>>> than try to find root nodes, I just looked for all kv_nodes in the = file and >>>> treated each of those as a separate virtual DB to be replicated. = This >>>> reduces the algorithmic complexity of the repair, and it looks like >>>> testwritesdb repairs in ~30 minutes or so. Also, this method = results in the >>>> lost+found DB containing every document, not just the missing ones. >>>>>>>>=20 >>>>>>>> My branch does not currently include Randall's parallelization = of >>>> the replications. It's still CPU-limited, so that may be a = worthwhile >>>> optimization. On the other hand, I think we may be reaching a = stage at >>>> which performance for this repair tool is 'good enough', and pmaps = can make >>>> error handling a bit dicey. >>>>>>>>=20 >>>>>>>> In short, I think this tool is now in good shape. >>>>>>>>=20 >>>>>>>> http://github.com/kocolosk/couchdb/tree/db_repair >>>>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>=20 >>>>=20 >>>=20 >>=20