incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Katz <dam...@apache.org>
Subject Re: Data loss
Date Sun, 08 Aug 2010 00:16:33 GMT
I reproduced this manually:

Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).

Attempt to create document "x" again, get conflict error.

Wait at least 2 secs to ensure the delayed commit attempt happens.

Now create document "y".

Wait at least 2 secs because the delayed commit should happen

Restart server.

Document "y" is now missing.

The last delayed commit isn't happening. From then on out, no docs updated with delayed commit
with be available after a restart.

-Damien

On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:

> I believe it's a single delayed conflict write attempt and no successes in that same
interval.
> 
> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
> 
>> Looks like all that's necessary is a single delayed conflict write attempt, and all
subsequent delayed commits won't be commit, the header never gets written.
>> 
>> 1.0 loses data. This is ridiculously bad.
>> 
>> We need a test to reproduce this and fix.
>> 
>> -Damien
>> 
>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>> 
>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your
patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious
bug.
>>> 
>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil}
when the headers match instead of moving the cancel_timer() command as you did.  After all,
we did perform the check here -- it was just that nothing needed to be committed.
>>> 
>>> Adam
>>> 
>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>> 
>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to
trigger the delayed_commit timer without actually having anything to commit, so the header
never changes.
>>>> 
>>>> Try to reproduce this and add a test case.
>>>> 
>>>> -Damien
>>>> 
>>>> 
>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>> 
>>>>> I think you may be right, Damien.
>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>> this what you're thinking? This would be the fix:
>>>>> http://gist.github.com/513282
>>>>> 
>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <damien@apache.org> wrote:
>>>>>> I think the problem might be that 2 conflicting write attempts in
row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that
happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> 
>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>> 
>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <randall.leeds@gmail.com>
wrote:
>>>>>>>> I agree completely! I immediately thought of this because
I wrote that
>>>>>>>> change. I spent a while staring at it last night but still
can't
>>>>>>>> imagine how it's a problem.
>>>>>>>> 
>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <damien@apache.org>
wrote:
>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>> 
>>>>>>>>> -Damien
>>>>>>>> 
>>>>>>> 
>>>>>>> I still want to stare at r954043, but it looks to me like there's
at
>>>>>>> least one situation where we do not commit data correctly during
>>>>>>> compaction. This has to do with the way we now use the path to
sync
>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>> http://gist.github.com/513081
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
> 


Mime
View raw message