couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randall Leeds <randall.le...@gmail.com>
Subject Re: data recovery tool progress
Date Tue, 10 Aug 2010 01:09:15 GMT
Summing up what went on in IRC for those who were absent.

The latest progress is on Adam's branch at
http://github.com/kocolosk/couchdb/tree/db_repair

couch_db_repair:make_lost_and_found/1 attempts to create a new
lost+found/DbName database to which it merges all nodes not accessible
from anywhere (any other node found in a full file scan or any header
pointers).

Currently, make_lost_and_found uses Volker's repair (from
couch_db_repair_b module, also in Adam's branch).
Adam found that the bottleneck was couch_file calls and that the
repair process was taking a very long time so he added
couch_db_repair:find_nodes_quickly/1 that reads 1MB chunks as binary
and tries to process it to find nodes instead of scanning back one
byte at a time. It is currently not hooked up to the repair mechanism.

Making progress. Go team.

On Mon, Aug 9, 2010 at 13:52, Mikeal Rogers <mikeal.rogers@gmail.com> wrote:
> jchris suggested on IRC that I try a normal doc update and see if that fixes
> it.
>
> It does. After a new doc was created the dbinfo doc count was back to
> normal.
>
> -Mikeal
>
> On Mon, Aug 9, 2010 at 1:39 PM, Mikeal Rogers <mikeal.rogers@gmail.com>wrote:
>
>> Ok, I pulled down this code and tested against a database with a ton of
>> missing writes right before a single restart.
>>
>> Before restart this was the database:
>>
>>   {
>>     db_name: "testwritesdb"
>>     doc_count: 124969
>>     doc_del_count: 0
>>     update_seq: 124969
>>     purge_seq: 0
>>     compact_running: false
>>     disk_size: 54857478
>>     instance_start_time: "1281384140058211"
>>     disk_format_version: 5
>>   }
>>
>> After restart it was this:
>>
>>   {
>>     db_name: "testwritesdb"
>>     doc_count: 1
>>     doc_del_count: 0
>>     update_seq: 1
>>     purge_seq: 0
>>     compact_running: false
>>     disk_size: 54857478
>>     instance_start_time: "1281384593876026"
>>     disk_format_version: 5
>>   }
>>
>> After repair, it's this:
>>
>> {
>>   db_name: "testwritesdb"
>>   doc_count: 1
>>   doc_del_count: 0
>>   update_seq: 124969
>>   purge_seq: 0
>>   compact_running: false
>>   disk_size: 54857820
>>   instance_start_time: "1281385990193289"
>>   disk_format_version: 5
>>   committed_update_seq: 124969
>> }
>>
>> All the sequences are there and hitting _all_docs shows all the documents
>> so why is the doc_count only 1 in the dbinfo?
>>
>> -Mikeal
>>
>> On Mon, Aug 9, 2010 at 11:53 AM, Filipe David Manana <fdmanana@apache.org>wrote:
>>
>>> For the record (and people not on IRC), the code at:
>>>
>>> http://github.com/fdmanana/couchdb/commits/db_repair
>>>
>>> is working for at least simple cases. Use
>>> couch_db_repair:repair(DbNameAsString).
>>> There's one TODO:  update the reduce values for the by_seq and by_id
>>> BTrees.
>>>
>>> If anyone wants to give some help on this, your welcome.
>>>
>>> On Mon, Aug 9, 2010 at 6:12 PM, Mikeal Rogers <mikeal.rogers@gmail.com
>>> >wrote:
>>>
>>> > I'm starting to create a bunch of test db files that expose this bug
>>> under
>>> > different conditions like multiple restarts, across compaction,
>>> variances
>>> > in
>>> > updates the might cause conflict, etc.
>>> >
>>> > http://github.com/mikeal/couchtest
>>> >
>>> > The README outlines what was done to the db's and what needs to be
>>> > recovered.
>>> >
>>> > -Mikeal
>>> >
>>> > On Mon, Aug 9, 2010 at 9:33 AM, Filipe David Manana <
>>> fdmanana@apache.org
>>> > >wrote:
>>> >
>>> > > On Mon, Aug 9, 2010 at 5:22 PM, Robert Newson <
>>> robert.newson@gmail.com
>>> > > >wrote:
>>> > >
>>> > > > Doesn't this bit;
>>> > > >
>>> > > > -        Db#db{waiting_delayed_commit=nil};
>>> > > > +        Db;
>>> > > > +        % Db#db{waiting_delayed_commit=nil};
>>> > > >
>>> > > > revert the bug fix?
>>> > > >
>>> > >
>>> > > That's intentional, for my local testing.
>>> > > That patch isn't obviously anything close to final, it's too
>>> experimental
>>> > > yet.
>>> > >
>>> > > >
>>> > > > B.
>>> > > >
>>> > > > On Mon, Aug 9, 2010 at 5:09 PM, Jan Lehnardt <jan@apache.org>
>>> wrote:
>>> > > > > Hi All,
>>> > > > >
>>> > > > > Filipe jumped in to start working on the recovery tool, but
he
>>> isn't
>>> > > done
>>> > > > yet.
>>> > > > >
>>> > > > > Here's the current patch:
>>> > > > >
>>> > > > > http://www.friendpaste.com/4uMngrym4r7Zz4R0ThSHbz
>>> > > > >
>>> > > > > it is not done and very early, but any help on this is greatly
>>> > > > appreciated.
>>> > > > >
>>> > > > > The current state is (in Filipe's words):
>>> > > > >  - i can detect that a file needs repair
>>> > > > >  - and get the last btree roots from it
>>> > > > >  - "only" missing: get last db seq num
>>> > > > >  - write new header
>>> > > > >  - and deal with the local docs btree (if exists)
>>> > > > >
>>> > > > > Thanks!
>>> > > > > Jan
>>> > > > > --
>>> > > > >
>>> > > > >
>>> > > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Filipe David Manana,
>>> > > fdmanana@apache.org
>>> > >
>>> > > "Reasonable men adapt themselves to the world.
>>> > >  Unreasonable men adapt the world to themselves.
>>> > >  That's why all progress depends on unreasonable men."
>>> > >
>>> >
>>>
>>>
>>>
>>> --
>>> Filipe David Manana,
>>> fdmanana@apache.org
>>>
>>> "Reasonable men adapt themselves to the world.
>>>  Unreasonable men adapt the world to themselves.
>>>  That's why all progress depends on unreasonable men."
>>>
>>
>>
>

Mime
View raw message