incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: Data loss
Date Sun, 08 Aug 2010 00:47:49 GMT
Committed to trunk and 1.0.x.

On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:

> http://github.com/tilgovi/couchdb/tree/fixlostcommits
> 
> Test and fix in separate commits at the end of that branch, based off
> current trunk.
> Would appreciate verification that the test is initially broken but
> fixed by the patch.
> 
> On Sat, Aug 7, 2010 at 17:16, Damien Katz <damien@apache.org> wrote:
>> I reproduced this manually:
>> 
>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say
2 secs).
>> 
>> Attempt to create document "x" again, get conflict error.
>> 
>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>> 
>> Now create document "y".
>> 
>> Wait at least 2 secs because the delayed commit should happen
>> 
>> Restart server.
>> 
>> Document "y" is now missing.
>> 
>> The last delayed commit isn't happening. From then on out, no docs updated with delayed
commit with be available after a restart.
>> 
>> -Damien
>> 
>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>> 
>>> I believe it's a single delayed conflict write attempt and no successes in that
same interval.
>>> 
>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>> 
>>>> Looks like all that's necessary is a single delayed conflict write attempt,
and all subsequent delayed commits won't be commit, the header never gets written.
>>>> 
>>>> 1.0 loses data. This is ridiculously bad.
>>>> 
>>>> We need a test to reproduce this and fix.
>>>> 
>>>> -Damien
>>>> 
>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>> 
>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall,
your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious
bug.
>>>>> 
>>>>> With respect to that gist 513282, I think it would be better to return
Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer()
command as you did.  After all, we did perform the check here -- it was just that nothing
needed to be committed.
>>>>> 
>>>>> Adam
>>>>> 
>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>> 
>>>>>> Yes, I think it requires 2 conflicting writes in row, because it
needs to trigger the delayed_commit timer without actually having anything to commit, so the
header never changes.
>>>>>> 
>>>>>> Try to reproduce this and add a test case.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> 
>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>> 
>>>>>>> I think you may be right, Damien.
>>>>>>> If ever a write happens that only contains conflicts while waiting
for
>>>>>>> a delayed commit message we might still be cancelling the timer.
Is
>>>>>>> this what you're thinking? This would be the fix:
>>>>>>> http://gist.github.com/513282
>>>>>>> 
>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <damien@apache.org>
wrote:
>>>>>>>> I think the problem might be that 2 conflicting write attempts
in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One
that happens, the header may never be written, as it always thinks a delayed commit will fire
soon.
>>>>>>>> 
>>>>>>>> -Damien
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>> 
>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <randall.leeds@gmail.com>
wrote:
>>>>>>>>>> I agree completely! I immediately thought of this
because I wrote that
>>>>>>>>>> change. I spent a while staring at it last night
but still can't
>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>> 
>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <damien@apache.org>
wrote:
>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging
further.
>>>>>>>>>>> 
>>>>>>>>>>> -Damien
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I still want to stare at r954043, but it looks to me
like there's at
>>>>>>>>> least one situation where we do not commit data correctly
during
>>>>>>>>> compaction. This has to do with the way we now use the
path to sync
>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>> http://gist.github.com/513081
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 


Mime
View raw message