Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 52355 invoked from network); 10 Aug 2010 01:09:47 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 10 Aug 2010 01:09:47 -0000 Received: (qmail 20988 invoked by uid 500); 10 Aug 2010 01:09:47 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 20933 invoked by uid 500); 10 Aug 2010 01:09:46 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 20925 invoked by uid 99); 10 Aug 2010 01:09:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Aug 2010 01:09:46 +0000 X-ASF-Spam-Status: No, hits=2.5 required=10.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of randall.leeds@gmail.com designates 209.85.216.173 as permitted sender) Received: from [209.85.216.173] (HELO mail-qy0-f173.google.com) (209.85.216.173) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Aug 2010 01:09:40 +0000 Received: by qyk35 with SMTP id 35so2916156qyk.11 for ; Mon, 09 Aug 2010 18:09:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=+lOuxasTE5h5IMrrupXMftBDBwazMhIcveSz77m1KvA=; b=KyhgiwV1ybOmoxm0bkzJmzN6v+bBYx4s0mZ3PU+a5om5ZXY8odV3WBkWp/2lYMDJNg /D91ndD/KADhosk32vn8Pyb3+jcnH7stYY5lJJomT0WbwGUSd7BDhEF/busqwoXP9qu5 73TO6K0/VL3YjHi7KxxgNtELGBvhdTgqJtBQk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=Z1tykhqs45w9CJpQvOIqrx5S+hcKlqU8huhSX9naeBaQZk4/t4znAsKZfP/IA0cpYp WpAgvZBNT3OY3zQjdSQfDwzzB0q4pVkxjMLQEsPla55WBSHVTpgFLpgogbUIWlYGm/na 2EnGA1/afpaiJcsh1NmbqbrGItas9D0JAt4ic= MIME-Version: 1.0 Received: by 10.229.223.210 with SMTP id il18mr7782846qcb.133.1281402555993; Mon, 09 Aug 2010 18:09:15 -0700 (PDT) Received: by 10.229.235.131 with HTTP; Mon, 9 Aug 2010 18:09:15 -0700 (PDT) In-Reply-To: References: <8385F758-360B-425A-ACBD-03C898BFDA21@apache.org> Date: Mon, 9 Aug 2010 18:09:15 -0700 Message-ID: Subject: Re: data recovery tool progress From: Randall Leeds To: dev@couchdb.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Summing up what went on in IRC for those who were absent. The latest progress is on Adam's branch at http://github.com/kocolosk/couchdb/tree/db_repair couch_db_repair:make_lost_and_found/1 attempts to create a new lost+found/DbName database to which it merges all nodes not accessible from anywhere (any other node found in a full file scan or any header pointers). Currently, make_lost_and_found uses Volker's repair (from couch_db_repair_b module, also in Adam's branch). Adam found that the bottleneck was couch_file calls and that the repair process was taking a very long time so he added couch_db_repair:find_nodes_quickly/1 that reads 1MB chunks as binary and tries to process it to find nodes instead of scanning back one byte at a time. It is currently not hooked up to the repair mechanism. Making progress. Go team. On Mon, Aug 9, 2010 at 13:52, Mikeal Rogers wrote= : > jchris suggested on IRC that I try a normal doc update and see if that fi= xes > it. > > It does. After a new doc was created the dbinfo doc count was back to > normal. > > -Mikeal > > On Mon, Aug 9, 2010 at 1:39 PM, Mikeal Rogers wr= ote: > >> Ok, I pulled down this code and tested against a database with a ton of >> missing writes right before a single restart. >> >> Before restart this was the database: >> >> =C2=A0 { >> =C2=A0 =C2=A0 db_name: "testwritesdb" >> =C2=A0 =C2=A0 doc_count: 124969 >> =C2=A0 =C2=A0 doc_del_count: 0 >> =C2=A0 =C2=A0 update_seq: 124969 >> =C2=A0 =C2=A0 purge_seq: 0 >> =C2=A0 =C2=A0 compact_running: false >> =C2=A0 =C2=A0 disk_size: 54857478 >> =C2=A0 =C2=A0 instance_start_time: "1281384140058211" >> =C2=A0 =C2=A0 disk_format_version: 5 >> =C2=A0 } >> >> After restart it was this: >> >> =C2=A0 { >> =C2=A0 =C2=A0 db_name: "testwritesdb" >> =C2=A0 =C2=A0 doc_count: 1 >> =C2=A0 =C2=A0 doc_del_count: 0 >> =C2=A0 =C2=A0 update_seq: 1 >> =C2=A0 =C2=A0 purge_seq: 0 >> =C2=A0 =C2=A0 compact_running: false >> =C2=A0 =C2=A0 disk_size: 54857478 >> =C2=A0 =C2=A0 instance_start_time: "1281384593876026" >> =C2=A0 =C2=A0 disk_format_version: 5 >> =C2=A0 } >> >> After repair, it's this: >> >> { >> =C2=A0 db_name: "testwritesdb" >> =C2=A0 doc_count: 1 >> =C2=A0 doc_del_count: 0 >> =C2=A0 update_seq: 124969 >> =C2=A0 purge_seq: 0 >> =C2=A0 compact_running: false >> =C2=A0 disk_size: 54857820 >> =C2=A0 instance_start_time: "1281385990193289" >> =C2=A0 disk_format_version: 5 >> =C2=A0 committed_update_seq: 124969 >> } >> >> All the sequences are there and hitting _all_docs shows all the document= s >> so why is the doc_count only 1 in the dbinfo? >> >> -Mikeal >> >> On Mon, Aug 9, 2010 at 11:53 AM, Filipe David Manana wrote: >> >>> For the record (and people not on IRC), the code at: >>> >>> http://github.com/fdmanana/couchdb/commits/db_repair >>> >>> is working for at least simple cases. Use >>> couch_db_repair:repair(DbNameAsString). >>> There's one TODO: =C2=A0update the reduce values for the by_seq and by_= id >>> BTrees. >>> >>> If anyone wants to give some help on this, your welcome. >>> >>> On Mon, Aug 9, 2010 at 6:12 PM, Mikeal Rogers >> >wrote: >>> >>> > I'm starting to create a bunch of test db files that expose this bug >>> under >>> > different conditions like multiple restarts, across compaction, >>> variances >>> > in >>> > updates the might cause conflict, etc. >>> > >>> > http://github.com/mikeal/couchtest >>> > >>> > The README outlines what was done to the db's and what needs to be >>> > recovered. >>> > >>> > -Mikeal >>> > >>> > On Mon, Aug 9, 2010 at 9:33 AM, Filipe David Manana < >>> fdmanana@apache.org >>> > >wrote: >>> > >>> > > On Mon, Aug 9, 2010 at 5:22 PM, Robert Newson < >>> robert.newson@gmail.com >>> > > >wrote: >>> > > >>> > > > Doesn't this bit; >>> > > > >>> > > > - =C2=A0 =C2=A0 =C2=A0 =C2=A0Db#db{waiting_delayed_commit=3Dnil}; >>> > > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0Db; >>> > > > + =C2=A0 =C2=A0 =C2=A0 =C2=A0% Db#db{waiting_delayed_commit=3Dnil= }; >>> > > > >>> > > > revert the bug fix? >>> > > > >>> > > >>> > > That's intentional, for my local testing. >>> > > That patch isn't obviously anything close to final, it's too >>> experimental >>> > > yet. >>> > > >>> > > > >>> > > > B. >>> > > > >>> > > > On Mon, Aug 9, 2010 at 5:09 PM, Jan Lehnardt >>> wrote: >>> > > > > Hi All, >>> > > > > >>> > > > > Filipe jumped in to start working on the recovery tool, but he >>> isn't >>> > > done >>> > > > yet. >>> > > > > >>> > > > > Here's the current patch: >>> > > > > >>> > > > > http://www.friendpaste.com/4uMngrym4r7Zz4R0ThSHbz >>> > > > > >>> > > > > it is not done and very early, but any help on this is greatly >>> > > > appreciated. >>> > > > > >>> > > > > The current state is (in Filipe's words): >>> > > > > =C2=A0- i can detect that a file needs repair >>> > > > > =C2=A0- and get the last btree roots from it >>> > > > > =C2=A0- "only" missing: get last db seq num >>> > > > > =C2=A0- write new header >>> > > > > =C2=A0- and deal with the local docs btree (if exists) >>> > > > > >>> > > > > Thanks! >>> > > > > Jan >>> > > > > -- >>> > > > > >>> > > > > >>> > > > >>> > > >>> > > >>> > > >>> > > -- >>> > > Filipe David Manana, >>> > > fdmanana@apache.org >>> > > >>> > > "Reasonable men adapt themselves to the world. >>> > > =C2=A0Unreasonable men adapt the world to themselves. >>> > > =C2=A0That's why all progress depends on unreasonable men." >>> > > >>> > >>> >>> >>> >>> -- >>> Filipe David Manana, >>> fdmanana@apache.org >>> >>> "Reasonable men adapt themselves to the world. >>> =C2=A0Unreasonable men adapt the world to themselves. >>> =C2=A0That's why all progress depends on unreasonable men." >>> >> >> >