couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: data recovery tool progress
Date Tue, 10 Aug 2010 02:10:10 GMT
Right, make_lost_and_found still relies on code which reads through couch_file one byte at
a time, that's the cause of the slowness.  The newer scanner will improve that pretty dramatically,
and we can tune it further by increasing the length of the pattern that we match when looking
for kp/kv_node terms in the files, at the expense of some extra complexity dealing with the
block prefixes (currently it does a 1-byte match, which as I understand it cannot be split
across blocks).

Regarding the file_corruption error on the larger file, I think this is something we will
just naturally trigger when we take a guess that random positions in a file are actually the
beginning of a term.  I think our best recourse here is to return {error, file_corruption}
from couch_file but leave the gen_server up and running instead of terminating it.  That way
the repair code can ignore the error and keep moving without having to reopen the file.

Next steps as I understand them - Randall is working on integrating the in-memory scanner
into Volker's code that finds all the dangling by_id nodes.  I'm working on making sure that
the scanner identifies bt node candidates which span block prefixes, and on improving its
pattern-matching.

Adam

On Aug 9, 2010, at 9:50 PM, Mikeal Rogers wrote:

> I pulled down the latest code from Adam's branch @
> 7080ff72baa329cf6c4be2a79e71a41f744ed93b.
> 
> Running timer:tc(couch_db_repair, make_lost_and_found, ["multi_conflict"]).
> on a database with 200 lost updates spanning 200 restarts (
> http://github.com/mikeal/couchtest/blob/master/multi_conflict.couch ) took
> about 101 seconds.
> 
> I tried running against a larger databases (
> http://github.com/mikeal/couchtest/blob/master/testwritesdb.couch ) and I
> got this exception:
> 
> http://gist.github.com/516491
> 
> -Mikeal
> 
> 
> 
> On Mon, Aug 9, 2010 at 6:09 PM, Randall Leeds <randall.leeds@gmail.com>wrote:
> 
>> Summing up what went on in IRC for those who were absent.
>> 
>> The latest progress is on Adam's branch at
>> http://github.com/kocolosk/couchdb/tree/db_repair
>> 
>> couch_db_repair:make_lost_and_found/1 attempts to create a new
>> lost+found/DbName database to which it merges all nodes not accessible
>> from anywhere (any other node found in a full file scan or any header
>> pointers).
>> 
>> Currently, make_lost_and_found uses Volker's repair (from
>> couch_db_repair_b module, also in Adam's branch).
>> Adam found that the bottleneck was couch_file calls and that the
>> repair process was taking a very long time so he added
>> couch_db_repair:find_nodes_quickly/1 that reads 1MB chunks as binary
>> and tries to process it to find nodes instead of scanning back one
>> byte at a time. It is currently not hooked up to the repair mechanism.
>> 
>> Making progress. Go team.
>> 
>> On Mon, Aug 9, 2010 at 13:52, Mikeal Rogers <mikeal.rogers@gmail.com>
>> wrote:
>>> jchris suggested on IRC that I try a normal doc update and see if that
>> fixes
>>> it.
>>> 
>>> It does. After a new doc was created the dbinfo doc count was back to
>>> normal.
>>> 
>>> -Mikeal
>>> 
>>> On Mon, Aug 9, 2010 at 1:39 PM, Mikeal Rogers <mikeal.rogers@gmail.com
>>> wrote:
>>> 
>>>> Ok, I pulled down this code and tested against a database with a ton of
>>>> missing writes right before a single restart.
>>>> 
>>>> Before restart this was the database:
>>>> 
>>>>  {
>>>>    db_name: "testwritesdb"
>>>>    doc_count: 124969
>>>>    doc_del_count: 0
>>>>    update_seq: 124969
>>>>    purge_seq: 0
>>>>    compact_running: false
>>>>    disk_size: 54857478
>>>>    instance_start_time: "1281384140058211"
>>>>    disk_format_version: 5
>>>>  }
>>>> 
>>>> After restart it was this:
>>>> 
>>>>  {
>>>>    db_name: "testwritesdb"
>>>>    doc_count: 1
>>>>    doc_del_count: 0
>>>>    update_seq: 1
>>>>    purge_seq: 0
>>>>    compact_running: false
>>>>    disk_size: 54857478
>>>>    instance_start_time: "1281384593876026"
>>>>    disk_format_version: 5
>>>>  }
>>>> 
>>>> After repair, it's this:
>>>> 
>>>> {
>>>>  db_name: "testwritesdb"
>>>>  doc_count: 1
>>>>  doc_del_count: 0
>>>>  update_seq: 124969
>>>>  purge_seq: 0
>>>>  compact_running: false
>>>>  disk_size: 54857820
>>>>  instance_start_time: "1281385990193289"
>>>>  disk_format_version: 5
>>>>  committed_update_seq: 124969
>>>> }
>>>> 
>>>> All the sequences are there and hitting _all_docs shows all the
>> documents
>>>> so why is the doc_count only 1 in the dbinfo?
>>>> 
>>>> -Mikeal
>>>> 
>>>> On Mon, Aug 9, 2010 at 11:53 AM, Filipe David Manana <
>> fdmanana@apache.org>wrote:
>>>> 
>>>>> For the record (and people not on IRC), the code at:
>>>>> 
>>>>> http://github.com/fdmanana/couchdb/commits/db_repair
>>>>> 
>>>>> is working for at least simple cases. Use
>>>>> couch_db_repair:repair(DbNameAsString).
>>>>> There's one TODO:  update the reduce values for the by_seq and by_id
>>>>> BTrees.
>>>>> 
>>>>> If anyone wants to give some help on this, your welcome.
>>>>> 
>>>>> On Mon, Aug 9, 2010 at 6:12 PM, Mikeal Rogers <mikeal.rogers@gmail.com
>>>>>> wrote:
>>>>> 
>>>>>> I'm starting to create a bunch of test db files that expose this
bug
>>>>> under
>>>>>> different conditions like multiple restarts, across compaction,
>>>>> variances
>>>>>> in
>>>>>> updates the might cause conflict, etc.
>>>>>> 
>>>>>> http://github.com/mikeal/couchtest
>>>>>> 
>>>>>> The README outlines what was done to the db's and what needs to be
>>>>>> recovered.
>>>>>> 
>>>>>> -Mikeal
>>>>>> 
>>>>>> On Mon, Aug 9, 2010 at 9:33 AM, Filipe David Manana <
>>>>> fdmanana@apache.org
>>>>>>> wrote:
>>>>>> 
>>>>>>> On Mon, Aug 9, 2010 at 5:22 PM, Robert Newson <
>>>>> robert.newson@gmail.com
>>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Doesn't this bit;
>>>>>>>> 
>>>>>>>> -        Db#db{waiting_delayed_commit=nil};
>>>>>>>> +        Db;
>>>>>>>> +        % Db#db{waiting_delayed_commit=nil};
>>>>>>>> 
>>>>>>>> revert the bug fix?
>>>>>>>> 
>>>>>>> 
>>>>>>> That's intentional, for my local testing.
>>>>>>> That patch isn't obviously anything close to final, it's too
>>>>> experimental
>>>>>>> yet.
>>>>>>> 
>>>>>>>> 
>>>>>>>> B.
>>>>>>>> 
>>>>>>>> On Mon, Aug 9, 2010 at 5:09 PM, Jan Lehnardt <jan@apache.org>
>>>>> wrote:
>>>>>>>>> Hi All,
>>>>>>>>> 
>>>>>>>>> Filipe jumped in to start working on the recovery tool,
but he
>>>>> isn't
>>>>>>> done
>>>>>>>> yet.
>>>>>>>>> 
>>>>>>>>> Here's the current patch:
>>>>>>>>> 
>>>>>>>>> http://www.friendpaste.com/4uMngrym4r7Zz4R0ThSHbz
>>>>>>>>> 
>>>>>>>>> it is not done and very early, but any help on this is
greatly
>>>>>>>> appreciated.
>>>>>>>>> 
>>>>>>>>> The current state is (in Filipe's words):
>>>>>>>>> - i can detect that a file needs repair
>>>>>>>>> - and get the last btree roots from it
>>>>>>>>> - "only" missing: get last db seq num
>>>>>>>>> - write new header
>>>>>>>>> - and deal with the local docs btree (if exists)
>>>>>>>>> 
>>>>>>>>> Thanks!
>>>>>>>>> Jan
>>>>>>>>> --
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Filipe David Manana,
>>>>>>> fdmanana@apache.org
>>>>>>> 
>>>>>>> "Reasonable men adapt themselves to the world.
>>>>>>> Unreasonable men adapt the world to themselves.
>>>>>>> That's why all progress depends on unreasonable men."
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Filipe David Manana,
>>>>> fdmanana@apache.org
>>>>> 
>>>>> "Reasonable men adapt themselves to the world.
>>>>> Unreasonable men adapt the world to themselves.
>>>>> That's why all progress depends on unreasonable men."
>>>>> 
>>>> 
>>>> 
>>> 
>> 


Mime
View raw message