incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Katz <dam...@apache.org>
Subject Re: Data loss
Date Sun, 08 Aug 2010 01:28:58 GMT
Thanks.

Anyone up to create a repair tool for when this happens? It should be possible to find the
previous header, then find the most recent btree roots, find the high seq and apply them to
the header and commit. I'm thinking this would be a one time server upgrade script.

-Damien


On Aug 7, 2010, at 5:47 PM, Adam Kocoloski wrote:

> Committed to trunk and 1.0.x.
> 
> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:
> 
>> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>> 
>> Test and fix in separate commits at the end of that branch, based off
>> current trunk.
>> Would appreciate verification that the test is initially broken but
>> fixed by the patch.
>> 
>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <damien@apache.org> wrote:
>>> I reproduced this manually:
>>> 
>>> Create document with id "x", ensure full commit (simply wait longer than 1 sec,
say 2 secs).
>>> 
>>> Attempt to create document "x" again, get conflict error.
>>> 
>>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>> 
>>> Now create document "y".
>>> 
>>> Wait at least 2 secs because the delayed commit should happen
>>> 
>>> Restart server.
>>> 
>>> Document "y" is now missing.
>>> 
>>> The last delayed commit isn't happening. From then on out, no docs updated with
delayed commit with be available after a restart.
>>> 
>>> -Damien
>>> 
>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>> 
>>>> I believe it's a single delayed conflict write attempt and no successes in
that same interval.
>>>> 
>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>> 
>>>>> Looks like all that's necessary is a single delayed conflict write attempt,
and all subsequent delayed commits won't be commit, the header never gets written.
>>>>> 
>>>>> 1.0 loses data. This is ridiculously bad.
>>>>> 
>>>>> We need a test to reproduce this and fix.
>>>>> 
>>>>> -Damien
>>>>> 
>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>> 
>>>>>> Good sleuthing guys, and my apologies for letting this through. 
Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused
this serious bug.
>>>>>> 
>>>>>> With respect to that gist 513282, I think it would be better to return
Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer()
command as you did.  After all, we did perform the check here -- it was just that nothing
needed to be committed.
>>>>>> 
>>>>>> Adam
>>>>>> 
>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>> 
>>>>>>> Yes, I think it requires 2 conflicting writes in row, because
it needs to trigger the delayed_commit timer without actually having anything to commit, so
the header never changes.
>>>>>>> 
>>>>>>> Try to reproduce this and add a test case.
>>>>>>> 
>>>>>>> -Damien
>>>>>>> 
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>> 
>>>>>>>> I think you may be right, Damien.
>>>>>>>> If ever a write happens that only contains conflicts while
waiting for
>>>>>>>> a delayed commit message we might still be cancelling the
timer. Is
>>>>>>>> this what you're thinking? This would be the fix:
>>>>>>>> http://gist.github.com/513282
>>>>>>>> 
>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <damien@apache.org>
wrote:
>>>>>>>>> I think the problem might be that 2 conflicting write
attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled.
One that happens, the header may never be written, as it always thinks a delayed commit will
fire soon.
>>>>>>>>> 
>>>>>>>>> -Damien
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>> 
>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <randall.leeds@gmail.com>
wrote:
>>>>>>>>>>> I agree completely! I immediately thought of
this because I wrote that
>>>>>>>>>>> change. I spent a while staring at it last night
but still can't
>>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <damien@apache.org>
wrote:
>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging
further.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Damien
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I still want to stare at r954043, but it looks to
me like there's at
>>>>>>>>>> least one situation where we do not commit data correctly
during
>>>>>>>>>> compaction. This has to do with the way we now use
the path to sync
>>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>>> http://gist.github.com/513081
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
> 


Mime
View raw message