couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikeal Rogers <mikeal.rog...@gmail.com>
Subject Re: data recovery tool progress
Date Wed, 11 Aug 2010 01:11:17 GMT
Found one issue, we weren't picking up design docs because it didn't have
admin privileges.

Adam fixed it and pushed and I've verified that it works now.

I wrote a little node script to show all recovered documents and expose any
documents that didn't make it in to lost+found.

http://github.com/mikeal/couchtest/blob/master/validate.js

Requires request, `npm install request`.

I'm now running recover on all the test db's I have and running the
validation script against them.

-Mikeal

On Tue, Aug 10, 2010 at 1:34 PM, Mikeal Rogers <mikeal.rogers@gmail.com>wrote:

> I have some timing number for the new code.
>
> multi_conflict has 200 lost documents and 201 documents total after
> recovery.
> 1> timer:tc(couch_db_repair, make_lost_and_found, ["multi_conflict"]).
> {25217069,ok}
> 25 seconds
>
> Something funky is going on here. Investigating.
> 1> timer:tc(couch_db_repair, make_lost_and_found,
> ["multi_conflict_with_attach"]).
> {654782,ok}
> .6 seconds
>
> This db has 124969 documents in it.
> 1> timer:tc(couch_db_repair, make_lost_and_found, ["testwritesdb"]).
> {1381969304,ok}
> 23 minutes
>
> This database is about 500megs and 46660 before recovery and 46801 after.
> 1> timer:tc(couch_db_repair, make_lost_and_found, ["prod"]).
> {2329669113,ok}
> 38.8 minutes
>
> -Mikeal
>
> On Tue, Aug 10, 2010 at 12:06 PM, Adam Kocoloski <kocolosk@apache.org>wrote:
>
>> Good idea.  Now we've got
>>
>> > [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 1048576
>> bytes at 1380102
>> > [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 1048576
>> bytes at 331526
>> > [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 331526
>> bytes at 0
>> > [info] [<0.33.0>] couch_db_repair writing 12 updates to
>> lost+found/testwritesdb
>> > [info] [<0.33.0>] couch_db_repair writing 9 updates to
>> lost+found/testwritesdb
>> > [info] [<0.33.0>] couch_db_repair writing 8 updates to
>> lost+found/testwritesdb
>>
>> Adam
>>
>> On Aug 10, 2010, at 2:29 PM, Robert Newson wrote:
>>
>> > It took 20 minutes before the first 'update' line came out, but now
>> > seems to be recovering smoothly. machine load is back down to sane
>> > levels.
>> >
>> > Suggest feedback during the hunting phase.
>> >
>> > B.
>> >
>> > On Tue, Aug 10, 2010 at 7:11 PM, Adam Kocoloski <kocolosk@apache.org>
>> wrote:
>> >> Thanks for the crosscheck.  I'm not aware of anything in the node
>> finder that would cause it to struggle mightily with healthy DBs.  It pretty
>> much ignores the health of the DB, in fact.  Would be interested to hear
>> more.
>> >>
>> >> On Aug 10, 2010, at 1:59 PM, Robert Newson wrote:
>> >>
>> >>> I verified the new code's ability to repair the testwritesdb. system
>> >>> load was smooth from start to finish.
>> >>>
>> >>> I started a further test on a different (healthy) database and system
>> >>> load was severe again, just collecting the roots (the lost+found db
>> >>> was not yet created when I aborted the attempt). I suspect the fact
>> >>> that it's healthy is the issue, so if I'm right, perhaps a warning is
>> >>> useful.
>> >>>
>> >>> B.
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Aug 10, 2010 at 6:53 PM, Adam Kocoloski <kocolosk@apache.org>
>> wrote:
>> >>>> Another update.  This morning I took a different tack and, rather
>> than try to find root nodes, I just looked for all kv_nodes in the file and
>> treated each of those as a separate virtual DB to be replicated.  This
>> reduces the algorithmic complexity of the repair, and it looks like
>> testwritesdb repairs in ~30 minutes or so.  Also, this method results in the
>> lost+found DB containing every document, not just the missing ones.
>> >>>>
>> >>>> My branch does not currently include Randall's parallelization of
the
>> replications.  It's still CPU-limited, so that may be a worthwhile
>> optimization.  On the other hand, I think we may be reaching a stage at
>> which performance for this repair tool is 'good enough', and pmaps can make
>> error handling a bit dicey.
>> >>>>
>> >>>> In short, I think this tool is now in good shape.
>> >>>>
>> >>>> http://github.com/kocolosk/couchdb/tree/db_repair
>> >>>>
>> >>
>> >>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message