Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of adam.kocoloski@gmail.com
 designates 209.85.216.173 as permitted sender)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1081)
Subject: Re: Data loss
From: Adam Kocoloski <kocolosk@apache.org>
In-Reply-To: <2CC2D0A5-C3D4-4535-857E-13279727F5F5@gmail.com>
Date: Sat, 7 Aug 2010 21:01:13 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <BC8C740C-6050-4773-AA80-AE33FF945F20@apache.org>
References: <AANLkTiku23xASr7UJZ2LQDSFwCbNH5gh5gZuUp8rZpT2@mail.gmail.com>
 <AANLkTinkeDKMQD3tcC0mi7LYKZr5589KDLTtqSaOF2v4@mail.gmail.com>
 <770C713F-BBA2-4E0C-B9BE-9441A053BCA4@apache.org>
 <AEC2839B-3A89-49C8-928C-D0D773D216B1@geek-it.de>
 <7C48F227-12CA-477E-9581-8E87EE4C1610@apache.org>
 <5A99E2CB-F53E-4435-8225-578946239068@apache.org>
 <AANLkTi==UuLebS1usBbyYe5v5xZea5Bn+YwEY4D+riuL@mail.gmail.com>
 <AANLkTinhd8koektKPxOzTQYXQkDnX6rtqwwB++41VWtv@mail.gmail.com>
 <E7016A59-CF52-4E29-A5BA-3100473F8FDF@apache.org>
 <AANLkTikDvN4wvC5Kh1P2-ufPWg=_B0WA-0zYV9YgTGBx@mail.gmail.com>
 <22319C7F-909D-47AD-94E3-F8C9C3369A9F@apache.org>
 <47434ED7-7E89-46E9-BF74-F4F9DFBF43AD@apache.org>
 <5711F16A-A8BB-499A-8DBA-AA02AF6E0BDC@apache.org>
 <C108FB0A-99A5-4249-A515-8BB14C39229F@apache.org>
 <874784AD-0EFB-4E9B-AAB9-D265B2D06D8F@apache.org>
 <AANLkTik03NmsWfFdCQuvmX9hD=5sH6+gT-HeQdypbkNS@mail.gmail.com>
 <36AA1959-96D2-42CF-8342-D7CD5D65206E@apache.org>
 <2CC2D0A5-C3D4-4535-857E-13279727F5F5@gmail.com>
To: dev@couchdb.apache.org

POSTing to /db/_ensure_full_commit will still cause a header to be =
written.

Switching to delayed_commits =3D false and then writing a document will =
cause a header to be written for that DB.

POSTing to /_ensure_full_commit for each DB and then flipping the =
delayed_commits to false will put a 1.0.0 server into a safe state with =
all data saved.

Adam

On Aug 7, 2010, at 8:57 PM, Chris Anderson wrote:

> Will switching a running 1.0 server to delayed_commits=3Dtrue cause =
the noncommitted headers to be written? Are there other remedies for =
folks with critical data in 1.0 who want to ensure they are safe?
>=20
> Chris
>=20
> Typed on glass.
>=20
> On Aug 7, 2010, at 5:47 PM, Adam Kocoloski <kocolosk@apache.org> =
wrote:
>=20
>> Committed to trunk and 1.0.x.
>>=20
>> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:
>>=20
>>> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>>>=20
>>> Test and fix in separate commits at the end of that branch, based =
off
>>> current trunk.
>>> Would appreciate verification that the test is initially broken but
>>> fixed by the patch.
>>>=20
>>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <damien@apache.org> wrote:
>>>> I reproduced this manually:
>>>>=20
>>>> Create document with id "x", ensure full commit (simply wait longer =
than 1 sec, say 2 secs).
>>>>=20
>>>> Attempt to create document "x" again, get conflict error.
>>>>=20
>>>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>>>=20
>>>> Now create document "y".
>>>>=20
>>>> Wait at least 2 secs because the delayed commit should happen
>>>>=20
>>>> Restart server.
>>>>=20
>>>> Document "y" is now missing.
>>>>=20
>>>> The last delayed commit isn't happening. =46rom then on out, no =
docs updated with delayed commit with be available after a restart.
>>>>=20
>>>> -Damien
>>>>=20
>>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>>>=20
>>>>> I believe it's a single delayed conflict write attempt and no =
successes in that same interval.
>>>>>=20
>>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>>>=20
>>>>>> Looks like all that's necessary is a single delayed conflict =
write attempt, and all subsequent delayed commits won't be commit, the =
header never gets written.
>>>>>>=20
>>>>>> 1.0 loses data. This is ridiculously bad.
>>>>>>=20
>>>>>> We need a test to reproduce this and fix.
>>>>>>=20
>>>>>> -Damien
>>>>>>=20
>>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>>>=20
>>>>>>> Good sleuthing guys, and my apologies for letting this through.  =
Randall, your patch in COUCHDB-794 was actually fine, it was my =
reworking of it that caused this serious bug.
>>>>>>>=20
>>>>>>> With respect to that gist 513282, I think it would be better to =
return Db#db{waiting_delayed_commit=3Dnil} when the headers match =
instead of moving the cancel_timer() command as you did.  After all, we =
did perform the check here -- it was just that nothing needed to be =
committed.
>>>>>>>=20
>>>>>>> Adam
>>>>>>>=20
>>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>>>=20
>>>>>>>> Yes, I think it requires 2 conflicting writes in row, because =
it needs to trigger the delayed_commit timer without actually having =
anything to commit, so the header never changes.
>>>>>>>>=20
>>>>>>>> Try to reproduce this and add a test case.
>>>>>>>>=20
>>>>>>>> -Damien
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>>>=20
>>>>>>>>> I think you may be right, Damien.
>>>>>>>>> If ever a write happens that only contains conflicts while =
waiting for
>>>>>>>>> a delayed commit message we might still be cancelling the =
timer. Is
>>>>>>>>> this what you're thinking? This would be the fix:
>>>>>>>>> http://gist.github.com/513282
>>>>>>>>>=20
>>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <damien@apache.org> =
wrote:
>>>>>>>>>> I think the problem might be that 2 conflicting write =
attempts in row can leave the #db.waiting_delayed_commit set but the =
timer has been cancelled. One that happens, the header may never be =
written, as it always thinks a delayed commit will fire soon.
>>>>>>>>>>=20
>>>>>>>>>> -Damien
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>>>=20
>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds =
<randall.leeds@gmail.com> wrote:
>>>>>>>>>>>> I agree completely! I immediately thought of this because I =
wrote that
>>>>>>>>>>>> change. I spent a while staring at it last night but still =
can't
>>>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>>>=20
>>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz =
<damien@apache.org> wrote:
>>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> -Damien
>>>>>>>>>>>>=20
>>>>>>>>>>>=20
>>>>>>>>>>> I still want to stare at r954043, but it looks to me like =
there's at
>>>>>>>>>>> least one situation where we do not commit data correctly =
during
>>>>>>>>>>> compaction. This has to do with the way we now use the path =
to sync
>>>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>>>> http://gist.github.com/513081
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>=20
>>>>>>>=20
>>>>>>=20
>>>>>=20
>>>>=20
>>>>=20
>>=20