incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noah Slater <nsla...@apache.org>
Subject Re: Data loss
Date Sun, 08 Aug 2010 03:03:30 GMT
Time to abort the vote then?

I'd like to get this fix into 1.0.1 if possible.

On 8 Aug 2010, at 02:28, Damien Katz wrote:

> Thanks.
> 
> Anyone up to create a repair tool for when this happens? It should be possible to find
the previous header, then find the most recent btree roots, find the high seq and apply them
to the header and commit. I'm thinking this would be a one time server upgrade script.
> 
> -Damien
> 
> 
> On Aug 7, 2010, at 5:47 PM, Adam Kocoloski wrote:
> 
>> Committed to trunk and 1.0.x.
>> 
>> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:
>> 
>>> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>>> 
>>> Test and fix in separate commits at the end of that branch, based off
>>> current trunk.
>>> Would appreciate verification that the test is initially broken but
>>> fixed by the patch.
>>> 
>>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <damien@apache.org> wrote:
>>>> I reproduced this manually:
>>>> 
>>>> Create document with id "x", ensure full commit (simply wait longer than
1 sec, say 2 secs).
>>>> 
>>>> Attempt to create document "x" again, get conflict error.
>>>> 
>>>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>>> 
>>>> Now create document "y".
>>>> 
>>>> Wait at least 2 secs because the delayed commit should happen
>>>> 
>>>> Restart server.
>>>> 
>>>> Document "y" is now missing.
>>>> 
>>>> The last delayed commit isn't happening. From then on out, no docs updated
with delayed commit with be available after a restart.
>>>> 
>>>> -Damien
>>>> 
>>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>>> 
>>>>> I believe it's a single delayed conflict write attempt and no successes
in that same interval.
>>>>> 
>>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>>> 
>>>>>> Looks like all that's necessary is a single delayed conflict write
attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>>>> 
>>>>>> 1.0 loses data. This is ridiculously bad.
>>>>>> 
>>>>>> We need a test to reproduce this and fix.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>>> 
>>>>>>> Good sleuthing guys, and my apologies for letting this through.
 Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused
this serious bug.
>>>>>>> 
>>>>>>> With respect to that gist 513282, I think it would be better
to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer()
command as you did.  After all, we did perform the check here -- it was just that nothing
needed to be committed.
>>>>>>> 
>>>>>>> Adam
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>>> 
>>>>>>>> Yes, I think it requires 2 conflicting writes in row, because
it needs to trigger the delayed_commit timer without actually having anything to commit, so
the header never changes.
>>>>>>>> 
>>>>>>>> Try to reproduce this and add a test case.
>>>>>>>> 
>>>>>>>> -Damien
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>>> 
>>>>>>>>> I think you may be right, Damien.
>>>>>>>>> If ever a write happens that only contains conflicts
while waiting for
>>>>>>>>> a delayed commit message we might still be cancelling
the timer. Is
>>>>>>>>> this what you're thinking? This would be the fix:
>>>>>>>>> http://gist.github.com/513282
>>>>>>>>> 
>>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <damien@apache.org>
wrote:
>>>>>>>>>> I think the problem might be that 2 conflicting write
attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled.
One that happens, the header may never be written, as it always thinks a delayed commit will
fire soon.
>>>>>>>>>> 
>>>>>>>>>> -Damien
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>>> 
>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <randall.leeds@gmail.com>
wrote:
>>>>>>>>>>>> I agree completely! I immediately thought
of this because I wrote that
>>>>>>>>>>>> change. I spent a while staring at it last
night but still can't
>>>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz
<damien@apache.org> wrote:
>>>>>>>>>>>>> SVN commit r954043 looks suspicious.
Digging further.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Damien
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I still want to stare at r954043, but it looks
to me like there's at
>>>>>>>>>>> least one situation where we do not commit data
correctly during
>>>>>>>>>>> compaction. This has to do with the way we now
use the path to sync
>>>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>>>> http://gist.github.com/513081
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
> 


Mime
View raw message