couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikeal Rogers <mikeal.rog...@gmail.com>
Subject Re: data recovery tool progress
Date Wed, 11 Aug 2010 02:40:27 GMT
Ok, latest code has been tested against every db that I have and it works
great.

What are our next steps here?

I'd like to get this out to all the people who didn't feel comfortable send
me their db to test against before we release it more widely.

-Mikeal

On Tue, Aug 10, 2010 at 6:11 PM, Mikeal Rogers <mikeal.rogers@gmail.com>wrote:

> Found one issue, we weren't picking up design docs because it didn't have
> admin privileges.
>
> Adam fixed it and pushed and I've verified that it works now.
>
> I wrote a little node script to show all recovered documents and expose any
> documents that didn't make it in to lost+found.
>
> http://github.com/mikeal/couchtest/blob/master/validate.js
>
> Requires request, `npm install request`.
>
> I'm now running recover on all the test db's I have and running the
> validation script against them.
>
> -Mikeal
>
>
> On Tue, Aug 10, 2010 at 1:34 PM, Mikeal Rogers <mikeal.rogers@gmail.com>wrote:
>
>> I have some timing number for the new code.
>>
>> multi_conflict has 200 lost documents and 201 documents total after
>> recovery.
>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["multi_conflict"]).
>> {25217069,ok}
>> 25 seconds
>>
>> Something funky is going on here. Investigating.
>> 1> timer:tc(couch_db_repair, make_lost_and_found,
>> ["multi_conflict_with_attach"]).
>> {654782,ok}
>> .6 seconds
>>
>> This db has 124969 documents in it.
>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["testwritesdb"]).
>> {1381969304,ok}
>> 23 minutes
>>
>> This database is about 500megs and 46660 before recovery and 46801 after.
>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["prod"]).
>> {2329669113,ok}
>> 38.8 minutes
>>
>> -Mikeal
>>
>> On Tue, Aug 10, 2010 at 12:06 PM, Adam Kocoloski <kocolosk@apache.org>wrote:
>>
>>> Good idea.  Now we've got
>>>
>>> > [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 1048576
>>> bytes at 1380102
>>> > [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 1048576
>>> bytes at 331526
>>> > [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 331526
>>> bytes at 0
>>> > [info] [<0.33.0>] couch_db_repair writing 12 updates to
>>> lost+found/testwritesdb
>>> > [info] [<0.33.0>] couch_db_repair writing 9 updates to
>>> lost+found/testwritesdb
>>> > [info] [<0.33.0>] couch_db_repair writing 8 updates to
>>> lost+found/testwritesdb
>>>
>>> Adam
>>>
>>> On Aug 10, 2010, at 2:29 PM, Robert Newson wrote:
>>>
>>> > It took 20 minutes before the first 'update' line came out, but now
>>> > seems to be recovering smoothly. machine load is back down to sane
>>> > levels.
>>> >
>>> > Suggest feedback during the hunting phase.
>>> >
>>> > B.
>>> >
>>> > On Tue, Aug 10, 2010 at 7:11 PM, Adam Kocoloski <kocolosk@apache.org>
>>> wrote:
>>> >> Thanks for the crosscheck.  I'm not aware of anything in the node
>>> finder that would cause it to struggle mightily with healthy DBs.  It pretty
>>> much ignores the health of the DB, in fact.  Would be interested to hear
>>> more.
>>> >>
>>> >> On Aug 10, 2010, at 1:59 PM, Robert Newson wrote:
>>> >>
>>> >>> I verified the new code's ability to repair the testwritesdb. system
>>> >>> load was smooth from start to finish.
>>> >>>
>>> >>> I started a further test on a different (healthy) database and system
>>> >>> load was severe again, just collecting the roots (the lost+found
db
>>> >>> was not yet created when I aborted the attempt). I suspect the fact
>>> >>> that it's healthy is the issue, so if I'm right, perhaps a warning
is
>>> >>> useful.
>>> >>>
>>> >>> B.
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Tue, Aug 10, 2010 at 6:53 PM, Adam Kocoloski <kocolosk@apache.org>
>>> wrote:
>>> >>>> Another update.  This morning I took a different tack and, rather
>>> than try to find root nodes, I just looked for all kv_nodes in the file and
>>> treated each of those as a separate virtual DB to be replicated.  This
>>> reduces the algorithmic complexity of the repair, and it looks like
>>> testwritesdb repairs in ~30 minutes or so.  Also, this method results in the
>>> lost+found DB containing every document, not just the missing ones.
>>> >>>>
>>> >>>> My branch does not currently include Randall's parallelization
of
>>> the replications.  It's still CPU-limited, so that may be a worthwhile
>>> optimization.  On the other hand, I think we may be reaching a stage at
>>> which performance for this repair tool is 'good enough', and pmaps can make
>>> error handling a bit dicey.
>>> >>>>
>>> >>>> In short, I think this tool is now in good shape.
>>> >>>>
>>> >>>> http://github.com/kocolosk/couchdb/tree/db_repair
>>> >>>>
>>> >>
>>> >>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message