Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 60313 invoked from network); 10 Aug 2010 01:50:54 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 10 Aug 2010 01:50:54 -0000 Received: (qmail 36480 invoked by uid 500); 10 Aug 2010 01:50:53 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 36427 invoked by uid 500); 10 Aug 2010 01:50:53 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 36419 invoked by uid 99); 10 Aug 2010 01:50:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Aug 2010 01:50:52 +0000 X-ASF-Spam-Status: No, hits=4.7 required=10.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mikeal.rogers@gmail.com designates 209.85.214.180 as permitted sender) Received: from [209.85.214.180] (HELO mail-iw0-f180.google.com) (209.85.214.180) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Aug 2010 01:50:48 +0000 Received: by iwn4 with SMTP id 4so5883785iwn.11 for ; Mon, 09 Aug 2010 18:50:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=LcRYM8JmwayAGyEb1tRdhnQg6Pq6XOQr2jqYsSPdirg=; b=DknhmLlir0KQu2gyf1q7wEZFefa6wsBWbfS1R4rd7VZwNGtWIEleti0BBTqMeW0JOM F6aF57UbX8jIdq3OtqkmI7iQQ6Um3aHGmjDxmhxZDZcy42vBWeL6vgy7Ll0iaOnXOn9c IW1om4SALfWWg6Atbx47xPV+qGHssXBuNaLlc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=k9dnqdJLMckEcCnqwaLZOwmK58UVQSMpwHJYvCO/xYYIduVPWS0aIqD8VsWEWiVM+l cGrEFWDdApWA/BRMPwJ+ieZI8kkr3Xn7lF7md7acFX+Bi3omdXfnIuuhGY9q4GdlEg4d yT+tPqbfmdNbgICvInL9ikv2L4iRWQvfPSqro= MIME-Version: 1.0 Received: by 10.231.166.72 with SMTP id l8mr19772493iby.95.1281405026847; Mon, 09 Aug 2010 18:50:26 -0700 (PDT) Received: by 10.231.186.202 with HTTP; Mon, 9 Aug 2010 18:50:26 -0700 (PDT) In-Reply-To: References: <8385F758-360B-425A-ACBD-03C898BFDA21@apache.org> Date: Mon, 9 Aug 2010 18:50:26 -0700 Message-ID: Subject: Re: data recovery tool progress From: Mikeal Rogers To: dev@couchdb.apache.org Content-Type: multipart/alternative; boundary=005045013e1b704c87048d6e5b2a --005045013e1b704c87048d6e5b2a Content-Type: text/plain; charset=ISO-8859-1 I pulled down the latest code from Adam's branch @ 7080ff72baa329cf6c4be2a79e71a41f744ed93b. Running timer:tc(couch_db_repair, make_lost_and_found, ["multi_conflict"]). on a database with 200 lost updates spanning 200 restarts ( http://github.com/mikeal/couchtest/blob/master/multi_conflict.couch ) took about 101 seconds. I tried running against a larger databases ( http://github.com/mikeal/couchtest/blob/master/testwritesdb.couch ) and I got this exception: http://gist.github.com/516491 -Mikeal On Mon, Aug 9, 2010 at 6:09 PM, Randall Leeds wrote: > Summing up what went on in IRC for those who were absent. > > The latest progress is on Adam's branch at > http://github.com/kocolosk/couchdb/tree/db_repair > > couch_db_repair:make_lost_and_found/1 attempts to create a new > lost+found/DbName database to which it merges all nodes not accessible > from anywhere (any other node found in a full file scan or any header > pointers). > > Currently, make_lost_and_found uses Volker's repair (from > couch_db_repair_b module, also in Adam's branch). > Adam found that the bottleneck was couch_file calls and that the > repair process was taking a very long time so he added > couch_db_repair:find_nodes_quickly/1 that reads 1MB chunks as binary > and tries to process it to find nodes instead of scanning back one > byte at a time. It is currently not hooked up to the repair mechanism. > > Making progress. Go team. > > On Mon, Aug 9, 2010 at 13:52, Mikeal Rogers > wrote: > > jchris suggested on IRC that I try a normal doc update and see if that > fixes > > it. > > > > It does. After a new doc was created the dbinfo doc count was back to > > normal. > > > > -Mikeal > > > > On Mon, Aug 9, 2010 at 1:39 PM, Mikeal Rogers >wrote: > > > >> Ok, I pulled down this code and tested against a database with a ton of > >> missing writes right before a single restart. > >> > >> Before restart this was the database: > >> > >> { > >> db_name: "testwritesdb" > >> doc_count: 124969 > >> doc_del_count: 0 > >> update_seq: 124969 > >> purge_seq: 0 > >> compact_running: false > >> disk_size: 54857478 > >> instance_start_time: "1281384140058211" > >> disk_format_version: 5 > >> } > >> > >> After restart it was this: > >> > >> { > >> db_name: "testwritesdb" > >> doc_count: 1 > >> doc_del_count: 0 > >> update_seq: 1 > >> purge_seq: 0 > >> compact_running: false > >> disk_size: 54857478 > >> instance_start_time: "1281384593876026" > >> disk_format_version: 5 > >> } > >> > >> After repair, it's this: > >> > >> { > >> db_name: "testwritesdb" > >> doc_count: 1 > >> doc_del_count: 0 > >> update_seq: 124969 > >> purge_seq: 0 > >> compact_running: false > >> disk_size: 54857820 > >> instance_start_time: "1281385990193289" > >> disk_format_version: 5 > >> committed_update_seq: 124969 > >> } > >> > >> All the sequences are there and hitting _all_docs shows all the > documents > >> so why is the doc_count only 1 in the dbinfo? > >> > >> -Mikeal > >> > >> On Mon, Aug 9, 2010 at 11:53 AM, Filipe David Manana < > fdmanana@apache.org>wrote: > >> > >>> For the record (and people not on IRC), the code at: > >>> > >>> http://github.com/fdmanana/couchdb/commits/db_repair > >>> > >>> is working for at least simple cases. Use > >>> couch_db_repair:repair(DbNameAsString). > >>> There's one TODO: update the reduce values for the by_seq and by_id > >>> BTrees. > >>> > >>> If anyone wants to give some help on this, your welcome. > >>> > >>> On Mon, Aug 9, 2010 at 6:12 PM, Mikeal Rogers >>> >wrote: > >>> > >>> > I'm starting to create a bunch of test db files that expose this bug > >>> under > >>> > different conditions like multiple restarts, across compaction, > >>> variances > >>> > in > >>> > updates the might cause conflict, etc. > >>> > > >>> > http://github.com/mikeal/couchtest > >>> > > >>> > The README outlines what was done to the db's and what needs to be > >>> > recovered. > >>> > > >>> > -Mikeal > >>> > > >>> > On Mon, Aug 9, 2010 at 9:33 AM, Filipe David Manana < > >>> fdmanana@apache.org > >>> > >wrote: > >>> > > >>> > > On Mon, Aug 9, 2010 at 5:22 PM, Robert Newson < > >>> robert.newson@gmail.com > >>> > > >wrote: > >>> > > > >>> > > > Doesn't this bit; > >>> > > > > >>> > > > - Db#db{waiting_delayed_commit=nil}; > >>> > > > + Db; > >>> > > > + % Db#db{waiting_delayed_commit=nil}; > >>> > > > > >>> > > > revert the bug fix? > >>> > > > > >>> > > > >>> > > That's intentional, for my local testing. > >>> > > That patch isn't obviously anything close to final, it's too > >>> experimental > >>> > > yet. > >>> > > > >>> > > > > >>> > > > B. > >>> > > > > >>> > > > On Mon, Aug 9, 2010 at 5:09 PM, Jan Lehnardt > >>> wrote: > >>> > > > > Hi All, > >>> > > > > > >>> > > > > Filipe jumped in to start working on the recovery tool, but he > >>> isn't > >>> > > done > >>> > > > yet. > >>> > > > > > >>> > > > > Here's the current patch: > >>> > > > > > >>> > > > > http://www.friendpaste.com/4uMngrym4r7Zz4R0ThSHbz > >>> > > > > > >>> > > > > it is not done and very early, but any help on this is greatly > >>> > > > appreciated. > >>> > > > > > >>> > > > > The current state is (in Filipe's words): > >>> > > > > - i can detect that a file needs repair > >>> > > > > - and get the last btree roots from it > >>> > > > > - "only" missing: get last db seq num > >>> > > > > - write new header > >>> > > > > - and deal with the local docs btree (if exists) > >>> > > > > > >>> > > > > Thanks! > >>> > > > > Jan > >>> > > > > -- > >>> > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > > >>> > > > >>> > > -- > >>> > > Filipe David Manana, > >>> > > fdmanana@apache.org > >>> > > > >>> > > "Reasonable men adapt themselves to the world. > >>> > > Unreasonable men adapt the world to themselves. > >>> > > That's why all progress depends on unreasonable men." > >>> > > > >>> > > >>> > >>> > >>> > >>> -- > >>> Filipe David Manana, > >>> fdmanana@apache.org > >>> > >>> "Reasonable men adapt themselves to the world. > >>> Unreasonable men adapt the world to themselves. > >>> That's why all progress depends on unreasonable men." > >>> > >> > >> > > > --005045013e1b704c87048d6e5b2a--