incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Carlson <e...@ericcarlson.co.uk>
Subject Re: Data loss
Date Sun, 08 Aug 2010 01:17:44 GMT
 Just tried it, and I can confirm that the new futon test fails without
the patch but succeeds with it. Also, Damien's manual method of
reproducing the problem loses data without the patch but everything
seems to work correctly with the patch.

-Eric

On 08/08/10 01:33, Randall Leeds wrote:
> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>
> Test and fix in separate commits at the end of that branch, based off
> current trunk.
> Would appreciate verification that the test is initially broken but
> fixed by the patch.
>
> On Sat, Aug 7, 2010 at 17:16, Damien Katz <damien@apache.org> wrote:
>> I reproduced this manually:
>>
>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say
2 secs).
>>
>> Attempt to create document "x" again, get conflict error.
>>
>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>
>> Now create document "y".
>>
>> Wait at least 2 secs because the delayed commit should happen
>>
>> Restart server.
>>
>> Document "y" is now missing.
>>
>> The last delayed commit isn't happening. From then on out, no docs updated with delayed
commit with be available after a restart.
>>
>> -Damien
>>
>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>
>>> I believe it's a single delayed conflict write attempt and no successes in that
same interval.
>>>
>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>
>>>> Looks like all that's necessary is a single delayed conflict write attempt,
and all subsequent delayed commits won't be commit, the header never gets written.
>>>>
>>>> 1.0 loses data. This is ridiculously bad.
>>>>
>>>> We need a test to reproduce this and fix.
>>>>
>>>> -Damien
>>>>
>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>
>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall,
your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious
bug.
>>>>>
>>>>> With respect to that gist 513282, I think it would be better to return
Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer()
command as you did.  After all, we did perform the check here -- it was just that nothing
needed to be committed.
>>>>>
>>>>> Adam
>>>>>
>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>
>>>>>> Yes, I think it requires 2 conflicting writes in row, because it
needs to trigger the delayed_commit timer without actually having anything to commit, so the
header never changes.
>>>>>>
>>>>>> Try to reproduce this and add a test case.
>>>>>>
>>>>>> -Damien
>>>>>>
>>>>>>
>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>
>>>>>>> I think you may be right, Damien.
>>>>>>> If ever a write happens that only contains conflicts while waiting
for
>>>>>>> a delayed commit message we might still be cancelling the timer.
Is
>>>>>>> this what you're thinking? This would be the fix:
>>>>>>> http://gist.github.com/513282
>>>>>>>
>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <damien@apache.org>
wrote:
>>>>>>>> I think the problem might be that 2 conflicting write attempts
in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One
that happens, the header may never be written, as it always thinks a delayed commit will fire
soon.
>>>>>>>>
>>>>>>>> -Damien
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>
>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <randall.leeds@gmail.com>
wrote:
>>>>>>>>>> I agree completely! I immediately thought of this
because I wrote that
>>>>>>>>>> change. I spent a while staring at it last night
but still can't
>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>
>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <damien@apache.org>
wrote:
>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging
further.
>>>>>>>>>>>
>>>>>>>>>>> -Damien
>>>>>>>>> I still want to stare at r954043, but it looks to me
like there's at
>>>>>>>>> least one situation where we do not commit data correctly
during
>>>>>>>>> compaction. This has to do with the way we now use the
path to sync
>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>> http://gist.github.com/513081
>>>>>>>>
>>

Mime
View raw message