couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Filipe David Manana <fdman...@apache.org>
Subject Re: data recovery tool progress
Date Wed, 11 Aug 2010 17:05:20 GMT
On Wed, Aug 11, 2010 at 3:52 AM, Adam Kocoloski <kocolosk@apache.org> wrote:

> Excellent, thanks for testing.  I caught Jason Smith saying on IRC that he
> had packaged the whole thing up as an escript + some .beams.  If we can get
> it down to a single file a la rebar that would be a pretty sweet way to
> deliver the repair tool in my opinion.
>

+1


>
> Adam
>
> On Aug 10, 2010, at 10:40 PM, Mikeal Rogers wrote:
>
> > Ok, latest code has been tested against every db that I have and it works
> > great.
> >
> > What are our next steps here?
> >
> > I'd like to get this out to all the people who didn't feel comfortable
> send
> > me their db to test against before we release it more widely.
> >
> > -Mikeal
> >
> > On Tue, Aug 10, 2010 at 6:11 PM, Mikeal Rogers <mikeal.rogers@gmail.com
> >wrote:
> >
> >> Found one issue, we weren't picking up design docs because it didn't
> have
> >> admin privileges.
> >>
> >> Adam fixed it and pushed and I've verified that it works now.
> >>
> >> I wrote a little node script to show all recovered documents and expose
> any
> >> documents that didn't make it in to lost+found.
> >>
> >> http://github.com/mikeal/couchtest/blob/master/validate.js
> >>
> >> Requires request, `npm install request`.
> >>
> >> I'm now running recover on all the test db's I have and running the
> >> validation script against them.
> >>
> >> -Mikeal
> >>
> >>
> >> On Tue, Aug 10, 2010 at 1:34 PM, Mikeal Rogers <mikeal.rogers@gmail.com
> >wrote:
> >>
> >>> I have some timing number for the new code.
> >>>
> >>> multi_conflict has 200 lost documents and 201 documents total after
> >>> recovery.
> >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["multi_conflict"]).
> >>> {25217069,ok}
> >>> 25 seconds
> >>>
> >>> Something funky is going on here. Investigating.
> >>> 1> timer:tc(couch_db_repair, make_lost_and_found,
> >>> ["multi_conflict_with_attach"]).
> >>> {654782,ok}
> >>> .6 seconds
> >>>
> >>> This db has 124969 documents in it.
> >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["testwritesdb"]).
> >>> {1381969304,ok}
> >>> 23 minutes
> >>>
> >>> This database is about 500megs and 46660 before recovery and 46801
> after.
> >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["prod"]).
> >>> {2329669113,ok}
> >>> 38.8 minutes
> >>>
> >>> -Mikeal
> >>>
> >>> On Tue, Aug 10, 2010 at 12:06 PM, Adam Kocoloski <kocolosk@apache.org
> >wrote:
> >>>
> >>>> Good idea.  Now we've got
> >>>>
> >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning
1048576
> >>>> bytes at 1380102
> >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning
1048576
> >>>> bytes at 331526
> >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning
331526
> >>>> bytes at 0
> >>>>> [info] [<0.33.0>] couch_db_repair writing 12 updates to
> >>>> lost+found/testwritesdb
> >>>>> [info] [<0.33.0>] couch_db_repair writing 9 updates to
> >>>> lost+found/testwritesdb
> >>>>> [info] [<0.33.0>] couch_db_repair writing 8 updates to
> >>>> lost+found/testwritesdb
> >>>>
> >>>> Adam
> >>>>
> >>>> On Aug 10, 2010, at 2:29 PM, Robert Newson wrote:
> >>>>
> >>>>> It took 20 minutes before the first 'update' line came out, but
now
> >>>>> seems to be recovering smoothly. machine load is back down to sane
> >>>>> levels.
> >>>>>
> >>>>> Suggest feedback during the hunting phase.
> >>>>>
> >>>>> B.
> >>>>>
> >>>>> On Tue, Aug 10, 2010 at 7:11 PM, Adam Kocoloski <kocolosk@apache.org
> >
> >>>> wrote:
> >>>>>> Thanks for the crosscheck.  I'm not aware of anything in the
node
> >>>> finder that would cause it to struggle mightily with healthy DBs.  It
> pretty
> >>>> much ignores the health of the DB, in fact.  Would be interested to
> hear
> >>>> more.
> >>>>>>
> >>>>>> On Aug 10, 2010, at 1:59 PM, Robert Newson wrote:
> >>>>>>
> >>>>>>> I verified the new code's ability to repair the testwritesdb.
> system
> >>>>>>> load was smooth from start to finish.
> >>>>>>>
> >>>>>>> I started a further test on a different (healthy) database
and
> system
> >>>>>>> load was severe again, just collecting the roots (the lost+found
db
> >>>>>>> was not yet created when I aborted the attempt). I suspect
the fact
> >>>>>>> that it's healthy is the issue, so if I'm right, perhaps
a warning
> is
> >>>>>>> useful.
> >>>>>>>
> >>>>>>> B.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Aug 10, 2010 at 6:53 PM, Adam Kocoloski <
> kocolosk@apache.org>
> >>>> wrote:
> >>>>>>>> Another update.  This morning I took a different tack
and, rather
> >>>> than try to find root nodes, I just looked for all kv_nodes in the
> file and
> >>>> treated each of those as a separate virtual DB to be replicated.  This
> >>>> reduces the algorithmic complexity of the repair, and it looks like
> >>>> testwritesdb repairs in ~30 minutes or so.  Also, this method results
> in the
> >>>> lost+found DB containing every document, not just the missing ones.
> >>>>>>>>
> >>>>>>>> My branch does not currently include Randall's parallelization
of
> >>>> the replications.  It's still CPU-limited, so that may be a worthwhile
> >>>> optimization.  On the other hand, I think we may be reaching a stage
> at
> >>>> which performance for this repair tool is 'good enough', and pmaps can
> make
> >>>> error handling a bit dicey.
> >>>>>>>>
> >>>>>>>> In short, I think this tool is now in good shape.
> >>>>>>>>
> >>>>>>>> http://github.com/kocolosk/couchdb/tree/db_repair
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
>
>


-- 
Filipe David Manana,
fdmanana@gmail.com, fdmanana@apache.org

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message