Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1081)
Subject: Re: data recovery tool progress
From: Jan Lehnardt <jan@apache.org>
In-Reply-To: <AANLkTikKEaMi=kdVKPjsqDpvwsMG7L8EvkM7xuNywvrM@mail.gmail.com>
Date: Tue, 10 Aug 2010 12:27:10 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <12229601-B7B8-4E98-931E-054DA00C5092@apache.org>
References: <8385F758-360B-425A-ACBD-03C898BFDA21@apache.org>
 <AANLkTik-b7ZRLh+mb1BhgupAEAGQ049o5F5AwPPVT3Qn@mail.gmail.com>
 <AANLkTikeG6apE8MNwwse26yHy1s4BiT_7vN-YqxKGWsS@mail.gmail.com>
 <AANLkTi=zPAbNxXv1J4D2WHjKQLCH_pD9pn0BvHs+jsaw@mail.gmail.com>
 <AANLkTim9md-HC+akByb7s3-QDUMvgY+L5UVbhqRJbyvg@mail.gmail.com>
 <AANLkTimcx+YVM3V9yWx3LrkBxoK-OjpkCpcFEz=p=QRv@mail.gmail.com>
 <AANLkTimQLCOR-=+VKEaRn86Zjfm_J3DVWvbAikUBM9Eh@mail.gmail.com>
 <AANLkTim-eGpRF6VSy7mLNZEMLUeN53CmtqRHtG_DSRt3@mail.gmail.com>
 <AANLkTincLdjDtZC8gD4_Xkk0XJNpO5aurnDCLQgc3=ik@mail.gmail.com>
 <1690416A-4C01-4756-9D3B-A256DC729813@apache.org>
 <154AD543-C787-441C-851B-D59CEA6765CC@apache.org>
 <5F47BBB4-9F58-4EFE-92C8-B0FEDA5B01B7@apache.org>
 <AANLkTimK-S_5uoBtq5Dmoy5iuH7RnZXs-XYsnntbDb=P@mail.gmail.com>
 <AANLkTikKEaMi=kdVKPjsqDpvwsMG7L8EvkM7xuNywvrM@mail.gmail.com>
To: dev@couchdb.apache.org


On 10 Aug 2010, at 10:55, Robert Newson wrote:

> In ran the db_repair code on a healthy database produced with
> delayed_commits=3Dtrue.
>=20
> The source db had 3218 docs. db_repair recovered 3120 and then =
returned with ok.

This looks like we are recovering nodes that don't need recovering =
because on a healthy db produced with delayed_commits=3Dtrue we should =
not have any orphans at all, so the lost and found db should be empty.


>=20
> I'm redoing that test, but this indicates we're not finding all roots.
>=20
> I note that the output file was 36 times the input file, which is a
> consequence of folding all possible roots. I think that needs to be in
> the release notes for the repair tool if that behavior remains when it
> ships.
>=20
> B.
>=20
> On Tue, Aug 10, 2010 at 9:09 AM, Mikeal Rogers =
<mikeal.rogers@gmail.com> wrote:
>> I think I found a bug in the current lost+found repair.
>>=20
>> I've been running it against the testwritesdb and it's in a state =
that is
>> never finishing.
>>=20
>> It's still spitting out these lines:
>>=20
>> [info] [<0.32.0>] writing 1001 updates to lost+found/testwritesdb
>>=20
>> Most are 1001 but there are also other random variances 452, 866, =
etc.
>>=20
>> But the file size and dbinfo hasn't budged in over 30 minutes. The =
size is
>> stuck at 34300002 with the original db file being 54857478 .
>>=20
>> This database only has one document in it that isn't "lost" so if =
it's
>> finding *any* new docs it should be writing them.
>>=20
>> I also started another job to recover a production db that is quite =
large,
>> 500megs, with the missing data a week or so back. This has been =
running for
>> 2 hours and has still not output anything or created the lost and =
found db
>> so I can only assume that it is in the same state.
>>=20
>> Both machines are still churning 100% CPU.
>>=20
>> -Mikeal
>>=20
>>=20
>> On Mon, Aug 9, 2010 at 11:26 PM, Adam Kocoloski <kocolosk@apache.org> =
wrote:
>>=20
>>> With Randall's help we hooked the new node scanner up to the =
lost+found DB
>>> generator.  It seems to work well enough for small DBs; for large =
DBs with
>>> lots of missing nodes the O(N^2) complexity of the problem catches =
up to the
>>> code and generating the lost+found DB takes quite some time.  Mikeal =
is
>>> running tests tonight.  The algo appears pretty CPU-limited, so a =
little
>>> parallelization may be warranted.
>>>=20
>>> http://github.com/kocolosk/couchdb/tree/db_repair
>>>=20
>>> Adam
>>>=20
>>> (I sent this previous update to myself instead of the list, so I'll =
forward
>>> it here ...)
>>>=20
>>> On Aug 10, 2010, at 12:01 AM, Adam Kocoloski wrote:
>>>=20
>>>> On Aug 9, 2010, at 10:10 PM, Adam Kocoloski wrote:
>>>>=20
>>>>> Right, make_lost_and_found still relies on code which reads =
through
>>> couch_file one byte at a time, that's the cause of the slowness.  =
The newer
>>> scanner will improve that pretty dramatically, and we can tune it =
further by
>>> increasing the length of the pattern that we match when looking for
>>> kp/kv_node terms in the files, at the expense of some extra =
complexity
>>> dealing with the block prefixes (currently it does a 1-byte match, =
which as
>>> I understand it cannot be split across blocks).
>>>>=20
>>>> The scanner now looks for a 7 byte match, unless it is within 6 =
bytes of
>>> a block boundary, in which case it looks for the longest possible =
match at
>>> that position.  The more specific match condition greatly reduces =
the # of
>>> calls to couch_file, and thus boosts the throughput.  On my laptop =
it can
>>> scan the testwritesdb.couch from Mikeal's couchtest repo (52 MB) in =
18
>>> seconds.
>>>>=20
>>>>> Regarding the file_corruption error on the larger file, I think =
this is
>>> something we will just naturally trigger when we take a guess that =
random
>>> positions in a file are actually the beginning of a term.  I think =
our best
>>> recourse here is to return {error, file_corruption} from couch_file =
but
>>> leave the gen_server up and running instead of terminating it.  That =
way the
>>> repair code can ignore the error and keep moving without having to =
reopen
>>> the file.
>>>>=20
>>>> I committed this change (to my db_repair branch) after consulting =
with
>>> Chris.  The longer match condition makes these spurious =
file_corruption
>>> triggers much less likely, but I think it's still a good thing not =
to crash
>>> the server when they happen.
>>>>=20
>>>>> Next steps as I understand them - Randall is working on =
integrating the
>>> in-memory scanner into Volker's code that finds all the dangling =
by_id
>>> nodes.  I'm working on making sure that the scanner identifies bt =
node
>>> candidates which span block prefixes, and on improving its =
pattern-matching.
>>>>=20
>>>> Latest from my end
>>>> http://github.com/kocolosk/couchdb/tree/db_repair
>>>>=20
>>>>>=20
>>>>> Adam
>>>>>=20
>>>>> On Aug 9, 2010, at 9:50 PM, Mikeal Rogers wrote:
>>>>>=20
>>>>>> I pulled down the latest code from Adam's branch @
>>>>>> 7080ff72baa329cf6c4be2a79e71a41f744ed93b.
>>>>>>=20
>>>>>> Running timer:tc(couch_db_repair, make_lost_and_found,
>>> ["multi_conflict"]).
>>>>>> on a database with 200 lost updates spanning 200 restarts (
>>>>>> =
http://github.com/mikeal/couchtest/blob/master/multi_conflict.couch )
>>> took
>>>>>> about 101 seconds.
>>>>>>=20
>>>>>> I tried running against a larger databases (
>>>>>> http://github.com/mikeal/couchtest/blob/master/testwritesdb.couch =
)
>>> and I
>>>>>> got this exception:
>>>>>>=20
>>>>>> http://gist.github.com/516491
>>>>>>=20
>>>>>> -Mikeal
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>> On Mon, Aug 9, 2010 at 6:09 PM, Randall Leeds =
<randall.leeds@gmail.com
>>>> wrote:
>>>>>>=20
>>>>>>> Summing up what went on in IRC for those who were absent.
>>>>>>>=20
>>>>>>> The latest progress is on Adam's branch at
>>>>>>> http://github.com/kocolosk/couchdb/tree/db_repair
>>>>>>>=20
>>>>>>> couch_db_repair:make_lost_and_found/1 attempts to create a new
>>>>>>> lost+found/DbName database to which it merges all nodes not =
accessible
>>>>>>> from anywhere (any other node found in a full file scan or any =
header
>>>>>>> pointers).
>>>>>>>=20
>>>>>>> Currently, make_lost_and_found uses Volker's repair (from
>>>>>>> couch_db_repair_b module, also in Adam's branch).
>>>>>>> Adam found that the bottleneck was couch_file calls and that the
>>>>>>> repair process was taking a very long time so he added
>>>>>>> couch_db_repair:find_nodes_quickly/1 that reads 1MB chunks as =
binary
>>>>>>> and tries to process it to find nodes instead of scanning back =
one
>>>>>>> byte at a time. It is currently not hooked up to the repair =
mechanism.
>>>>>>>=20
>>>>>>> Making progress. Go team.
>>>>>>>=20
>>>>>>> On Mon, Aug 9, 2010 at 13:52, Mikeal Rogers =
<mikeal.rogers@gmail.com>
>>>>>>> wrote:
>>>>>>>> jchris suggested on IRC that I try a normal doc update and see =
if
>>> that
>>>>>>> fixes
>>>>>>>> it.
>>>>>>>>=20
>>>>>>>> It does. After a new doc was created the dbinfo doc count was =
back to
>>>>>>>> normal.
>>>>>>>>=20
>>>>>>>> -Mikeal
>>>>>>>>=20
>>>>>>>> On Mon, Aug 9, 2010 at 1:39 PM, Mikeal Rogers <
>>> mikeal.rogers@gmail.com
>>>>>>>> wrote:
>>>>>>>>=20
>>>>>>>>> Ok, I pulled down this code and tested against a database with =
a ton
>>> of
>>>>>>>>> missing writes right before a single restart.
>>>>>>>>>=20
>>>>>>>>> Before restart this was the database:
>>>>>>>>>=20
>>>>>>>>> {
>>>>>>>>> db_name: "testwritesdb"
>>>>>>>>> doc_count: 124969
>>>>>>>>> doc_del_count: 0
>>>>>>>>> update_seq: 124969
>>>>>>>>> purge_seq: 0
>>>>>>>>> compact_running: false
>>>>>>>>> disk_size: 54857478
>>>>>>>>> instance_start_time: "1281384140058211"
>>>>>>>>> disk_format_version: 5
>>>>>>>>> }
>>>>>>>>>=20
>>>>>>>>> After restart it was this:
>>>>>>>>>=20
>>>>>>>>> {
>>>>>>>>> db_name: "testwritesdb"
>>>>>>>>> doc_count: 1
>>>>>>>>> doc_del_count: 0
>>>>>>>>> update_seq: 1
>>>>>>>>> purge_seq: 0
>>>>>>>>> compact_running: false
>>>>>>>>> disk_size: 54857478
>>>>>>>>> instance_start_time: "1281384593876026"
>>>>>>>>> disk_format_version: 5
>>>>>>>>> }
>>>>>>>>>=20
>>>>>>>>> After repair, it's this:
>>>>>>>>>=20
>>>>>>>>> {
>>>>>>>>> db_name: "testwritesdb"
>>>>>>>>> doc_count: 1
>>>>>>>>> doc_del_count: 0
>>>>>>>>> update_seq: 124969
>>>>>>>>> purge_seq: 0
>>>>>>>>> compact_running: false
>>>>>>>>> disk_size: 54857820
>>>>>>>>> instance_start_time: "1281385990193289"
>>>>>>>>> disk_format_version: 5
>>>>>>>>> committed_update_seq: 124969
>>>>>>>>> }
>>>>>>>>>=20
>>>>>>>>> All the sequences are there and hitting _all_docs shows all =
the
>>>>>>> documents
>>>>>>>>> so why is the doc_count only 1 in the dbinfo?
>>>>>>>>>=20
>>>>>>>>> -Mikeal
>>>>>>>>>=20
>>>>>>>>> On Mon, Aug 9, 2010 at 11:53 AM, Filipe David Manana <
>>>>>>> fdmanana@apache.org>wrote:
>>>>>>>>>=20
>>>>>>>>>> For the record (and people not on IRC), the code at:
>>>>>>>>>>=20
>>>>>>>>>> http://github.com/fdmanana/couchdb/commits/db_repair
>>>>>>>>>>=20
>>>>>>>>>> is working for at least simple cases. Use
>>>>>>>>>> couch_db_repair:repair(DbNameAsString).
>>>>>>>>>> There's one TODO:  update the reduce values for the by_seq =
and
>>> by_id
>>>>>>>>>> BTrees.
>>>>>>>>>>=20
>>>>>>>>>> If anyone wants to give some help on this, your welcome.
>>>>>>>>>>=20
>>>>>>>>>> On Mon, Aug 9, 2010 at 6:12 PM, Mikeal Rogers <
>>> mikeal.rogers@gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>>=20
>>>>>>>>>>> I'm starting to create a bunch of test db files that expose =
this
>>> bug
>>>>>>>>>> under
>>>>>>>>>>> different conditions like multiple restarts, across =
compaction,
>>>>>>>>>> variances
>>>>>>>>>>> in
>>>>>>>>>>> updates the might cause conflict, etc.
>>>>>>>>>>>=20
>>>>>>>>>>> http://github.com/mikeal/couchtest
>>>>>>>>>>>=20
>>>>>>>>>>> The README outlines what was done to the db's and what needs =
to be
>>>>>>>>>>> recovered.
>>>>>>>>>>>=20
>>>>>>>>>>> -Mikeal
>>>>>>>>>>>=20
>>>>>>>>>>> On Mon, Aug 9, 2010 at 9:33 AM, Filipe David Manana <
>>>>>>>>>> fdmanana@apache.org
>>>>>>>>>>>> wrote:
>>>>>>>>>>>=20
>>>>>>>>>>>> On Mon, Aug 9, 2010 at 5:22 PM, Robert Newson <
>>>>>>>>>> robert.newson@gmail.com
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>=20
>>>>>>>>>>>>> Doesn't this bit;
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> -        Db#db{waiting_delayed_commit=3Dnil};
>>>>>>>>>>>>> +        Db;
>>>>>>>>>>>>> +        % Db#db{waiting_delayed_commit=3Dnil};
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> revert the bug fix?
>>>>>>>>>>>>>=20
>>>>>>>>>>>>=20
>>>>>>>>>>>> That's intentional, for my local testing.
>>>>>>>>>>>> That patch isn't obviously anything close to final, it's =
too
>>>>>>>>>> experimental
>>>>>>>>>>>> yet.
>>>>>>>>>>>>=20
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> B.
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> On Mon, Aug 9, 2010 at 5:09 PM, Jan Lehnardt =
<jan@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> Filipe jumped in to start working on the recovery tool, =
but he
>>>>>>>>>> isn't
>>>>>>>>>>>> done
>>>>>>>>>>>>> yet.
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> Here's the current patch:
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> http://www.friendpaste.com/4uMngrym4r7Zz4R0ThSHbz
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> it is not done and very early, but any help on this is =
greatly
>>>>>>>>>>>>> appreciated.
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> The current state is (in Filipe's words):
>>>>>>>>>>>>>> - i can detect that a file needs repair
>>>>>>>>>>>>>> - and get the last btree roots from it
>>>>>>>>>>>>>> - "only" missing: get last db seq num
>>>>>>>>>>>>>> - write new header
>>>>>>>>>>>>>> - and deal with the local docs btree (if exists)
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>=20
>>>>>>>>>>>>=20
>>>>>>>>>>>>=20
>>>>>>>>>>>>=20
>>>>>>>>>>>> --
>>>>>>>>>>>> Filipe David Manana,
>>>>>>>>>>>> fdmanana@apache.org
>>>>>>>>>>>>=20
>>>>>>>>>>>> "Reasonable men adapt themselves to the world.
>>>>>>>>>>>> Unreasonable men adapt the world to themselves.
>>>>>>>>>>>> That's why all progress depends on unreasonable men."
>>>>>>>>>>>>=20
>>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>> --
>>>>>>>>>> Filipe David Manana,
>>>>>>>>>> fdmanana@apache.org
>>>>>>>>>>=20
>>>>>>>>>> "Reasonable men adapt themselves to the world.
>>>>>>>>>> Unreasonable men adapt the world to themselves.
>>>>>>>>>> That's why all progress depends on unreasonable men."
>>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>=20
>>>>>>>=20
>>>>>=20
>>>>=20
>>>=20
>>>=20
>>=20