Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 86437 invoked from network); 8 Aug 2010 01:01:45 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Aug 2010 01:01:45 -0000 Received: (qmail 6215 invoked by uid 500); 8 Aug 2010 01:01:44 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 6168 invoked by uid 500); 8 Aug 2010 01:01:44 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 6160 invoked by uid 99); 8 Aug 2010 01:01:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 08 Aug 2010 01:01:44 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of adam.kocoloski@gmail.com designates 209.85.216.173 as permitted sender) Received: from [209.85.216.173] (HELO mail-qy0-f173.google.com) (209.85.216.173) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 08 Aug 2010 01:01:37 +0000 Received: by qyk35 with SMTP id 35so1060663qyk.11 for ; Sat, 07 Aug 2010 18:01:16 -0700 (PDT) Received: by 10.229.251.132 with SMTP id ms4mr1622355qcb.205.1281229276290; Sat, 07 Aug 2010 18:01:16 -0700 (PDT) Received: from [10.0.1.4] (c-71-232-49-44.hsd1.ma.comcast.net [71.232.49.44]) by mx.google.com with ESMTPS id t1sm4094528qcs.21.2010.08.07.18.01.14 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 07 Aug 2010 18:01:15 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1081) Subject: Re: Data loss From: Adam Kocoloski In-Reply-To: <2CC2D0A5-C3D4-4535-857E-13279727F5F5@gmail.com> Date: Sat, 7 Aug 2010 21:01:13 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: References: <770C713F-BBA2-4E0C-B9BE-9441A053BCA4@apache.org> <7C48F227-12CA-477E-9581-8E87EE4C1610@apache.org> <5A99E2CB-F53E-4435-8225-578946239068@apache.org> <22319C7F-909D-47AD-94E3-F8C9C3369A9F@apache.org> <47434ED7-7E89-46E9-BF74-F4F9DFBF43AD@apache.org> <5711F16A-A8BB-499A-8DBA-AA02AF6E0BDC@apache.org> <874784AD-0EFB-4E9B-AAB9-D265B2D06D8F@apache.org> <36AA1959-96D2-42CF-8342-D7CD5D65206E@apache.org> <2CC2D0A5-C3D4-4535-857E-13279727F5F5@gmail.com> To: dev@couchdb.apache.org X-Mailer: Apple Mail (2.1081) X-Virus-Checked: Checked by ClamAV on apache.org POSTing to /db/_ensure_full_commit will still cause a header to be = written. Switching to delayed_commits =3D false and then writing a document will = cause a header to be written for that DB. POSTing to /_ensure_full_commit for each DB and then flipping the = delayed_commits to false will put a 1.0.0 server into a safe state with = all data saved. Adam On Aug 7, 2010, at 8:57 PM, Chris Anderson wrote: > Will switching a running 1.0 server to delayed_commits=3Dtrue cause = the noncommitted headers to be written? Are there other remedies for = folks with critical data in 1.0 who want to ensure they are safe? >=20 > Chris >=20 > Typed on glass. >=20 > On Aug 7, 2010, at 5:47 PM, Adam Kocoloski = wrote: >=20 >> Committed to trunk and 1.0.x. >>=20 >> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote: >>=20 >>> http://github.com/tilgovi/couchdb/tree/fixlostcommits >>>=20 >>> Test and fix in separate commits at the end of that branch, based = off >>> current trunk. >>> Would appreciate verification that the test is initially broken but >>> fixed by the patch. >>>=20 >>> On Sat, Aug 7, 2010 at 17:16, Damien Katz wrote: >>>> I reproduced this manually: >>>>=20 >>>> Create document with id "x", ensure full commit (simply wait longer = than 1 sec, say 2 secs). >>>>=20 >>>> Attempt to create document "x" again, get conflict error. >>>>=20 >>>> Wait at least 2 secs to ensure the delayed commit attempt happens. >>>>=20 >>>> Now create document "y". >>>>=20 >>>> Wait at least 2 secs because the delayed commit should happen >>>>=20 >>>> Restart server. >>>>=20 >>>> Document "y" is now missing. >>>>=20 >>>> The last delayed commit isn't happening. =46rom then on out, no = docs updated with delayed commit with be available after a restart. >>>>=20 >>>> -Damien >>>>=20 >>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote: >>>>=20 >>>>> I believe it's a single delayed conflict write attempt and no = successes in that same interval. >>>>>=20 >>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote: >>>>>=20 >>>>>> Looks like all that's necessary is a single delayed conflict = write attempt, and all subsequent delayed commits won't be commit, the = header never gets written. >>>>>>=20 >>>>>> 1.0 loses data. This is ridiculously bad. >>>>>>=20 >>>>>> We need a test to reproduce this and fix. >>>>>>=20 >>>>>> -Damien >>>>>>=20 >>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote: >>>>>>=20 >>>>>>> Good sleuthing guys, and my apologies for letting this through. = Randall, your patch in COUCHDB-794 was actually fine, it was my = reworking of it that caused this serious bug. >>>>>>>=20 >>>>>>> With respect to that gist 513282, I think it would be better to = return Db#db{waiting_delayed_commit=3Dnil} when the headers match = instead of moving the cancel_timer() command as you did. After all, we = did perform the check here -- it was just that nothing needed to be = committed. >>>>>>>=20 >>>>>>> Adam >>>>>>>=20 >>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote: >>>>>>>=20 >>>>>>>> Yes, I think it requires 2 conflicting writes in row, because = it needs to trigger the delayed_commit timer without actually having = anything to commit, so the header never changes. >>>>>>>>=20 >>>>>>>> Try to reproduce this and add a test case. >>>>>>>>=20 >>>>>>>> -Damien >>>>>>>>=20 >>>>>>>>=20 >>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote: >>>>>>>>=20 >>>>>>>>> I think you may be right, Damien. >>>>>>>>> If ever a write happens that only contains conflicts while = waiting for >>>>>>>>> a delayed commit message we might still be cancelling the = timer. Is >>>>>>>>> this what you're thinking? This would be the fix: >>>>>>>>> http://gist.github.com/513282 >>>>>>>>>=20 >>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz = wrote: >>>>>>>>>> I think the problem might be that 2 conflicting write = attempts in row can leave the #db.waiting_delayed_commit set but the = timer has been cancelled. One that happens, the header may never be = written, as it always thinks a delayed commit will fire soon. >>>>>>>>>>=20 >>>>>>>>>> -Damien >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote: >>>>>>>>>>=20 >>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds = wrote: >>>>>>>>>>>> I agree completely! I immediately thought of this because I = wrote that >>>>>>>>>>>> change. I spent a while staring at it last night but still = can't >>>>>>>>>>>> imagine how it's a problem. >>>>>>>>>>>>=20 >>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz = wrote: >>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further. >>>>>>>>>>>>>=20 >>>>>>>>>>>>> -Damien >>>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> I still want to stare at r954043, but it looks to me like = there's at >>>>>>>>>>> least one situation where we do not commit data correctly = during >>>>>>>>>>> compaction. This has to do with the way we now use the path = to sync >>>>>>>>>>> outside the couch_file:process. Check this diff: >>>>>>>>>>> http://gist.github.com/513081 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>=20 >>>>>>>=20 >>>>>>=20 >>>>>=20 >>>>=20 >>>>=20 >>=20