Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 84880 invoked from network); 8 Aug 2010 00:58:24 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Aug 2010 00:58:24 -0000 Received: (qmail 4517 invoked by uid 500); 8 Aug 2010 00:58:24 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 4473 invoked by uid 500); 8 Aug 2010 00:58:24 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 4465 invoked by uid 99); 8 Aug 2010 00:58:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 08 Aug 2010 00:58:24 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jchris@gmail.com designates 209.85.212.180 as permitted sender) Received: from [209.85.212.180] (HELO mail-px0-f180.google.com) (209.85.212.180) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 08 Aug 2010 00:58:18 +0000 Received: by pxi3 with SMTP id 3so5337382pxi.11 for ; Sat, 07 Aug 2010 17:57:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:references:in-reply-to :mime-version:content-transfer-encoding:content-type:message-id:cc :x-mailer:from:subject:date:to; bh=Tn2kmeE7+GOWTbAIxkMnzcGNrt+TVAiySt15Et0MhCg=; b=B/S6WKcfXbKa4meXPWivoW41cJEFNWwxc45vMFieIIzQRrz58xomSJfzgvsrg82yQY e89LOmbW38vfkGT/b+YjWLcAAdBihqJJdFwpwub2L7hyrsfRaDrptR6Njha0kCQefH+C 0nkKTbzCLLjuD2hU7sXZQRboqA068Vi5Px5HM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=references:in-reply-to:mime-version:content-transfer-encoding :content-type:message-id:cc:x-mailer:from:subject:date:to; b=s/jbmRUtdRl1Xrs9IsWGlpD5fvicVtrxMgfxu2RvLT9W8eqoXiXvAoAby2T5VhKyw7 He7id2K9lWmN1jQkhhH0CzpELqYd+eriMpLL+zIxVxavpJdein9wQSWyTTE6OfPYJ6ao yuxzfYv6fmFVir8KQ+a+Fb4NZz0pwJMzlOwCc= Received: by 10.142.58.10 with SMTP id g10mr11514954wfa.1.1281229077822; Sat, 07 Aug 2010 17:57:57 -0700 (PDT) Received: from [10.17.252.166] ([166.191.219.58]) by mx.google.com with ESMTPS id z1sm3991426wfd.3.2010.08.07.17.57.52 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 07 Aug 2010 17:57:56 -0700 (PDT) References: <770C713F-BBA2-4E0C-B9BE-9441A053BCA4@apache.org> <7C48F227-12CA-477E-9581-8E87EE4C1610@apache.org> <5A99E2CB-F53E-4435-8225-578946239068@apache.org> <22319C7F-909D-47AD-94E3-F8C9C3369A9F@apache.org> <47434ED7-7E89-46E9-BF74-F4F9DFBF43AD@apache.org> <5711F16A-A8BB-499A-8DBA-AA02AF6E0BDC@apache.org> <874784AD-0EFB-4E9B-AAB9-D265B2D06D8F@apache.org> <36AA1959-96D2-42CF-8342-D7CD5D65206E@apache.org> In-Reply-To: <36AA1959-96D2-42CF-8342-D7CD5D65206E@apache.org> Mime-Version: 1.0 (iPhone Mail 8A306) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Message-Id: <2CC2D0A5-C3D4-4535-857E-13279727F5F5@gmail.com> Cc: "dev@couchdb.apache.org" X-Mailer: iPhone Mail (8A306) From: Chris Anderson Subject: Re: Data loss Date: Sat, 7 Aug 2010 17:57:23 -0700 To: "dev@couchdb.apache.org" Will switching a running 1.0 server to delayed_commits=3Dtrue cause the nonc= ommitted headers to be written? Are there other remedies for folks with crit= ical data in 1.0 who want to ensure they are safe? Chris Typed on glass. On Aug 7, 2010, at 5:47 PM, Adam Kocoloski wrote: > Committed to trunk and 1.0.x. >=20 > On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote: >=20 >> http://github.com/tilgovi/couchdb/tree/fixlostcommits >>=20 >> Test and fix in separate commits at the end of that branch, based off >> current trunk. >> Would appreciate verification that the test is initially broken but >> fixed by the patch. >>=20 >> On Sat, Aug 7, 2010 at 17:16, Damien Katz wrote: >>> I reproduced this manually: >>>=20 >>> Create document with id "x", ensure full commit (simply wait longer than= 1 sec, say 2 secs). >>>=20 >>> Attempt to create document "x" again, get conflict error. >>>=20 >>> Wait at least 2 secs to ensure the delayed commit attempt happens. >>>=20 >>> Now create document "y". >>>=20 >>> Wait at least 2 secs because the delayed commit should happen >>>=20 >>> Restart server. >>>=20 >>> Document "y" is now missing. >>>=20 >>> The last delayed commit isn't happening. =46rom then on out, no docs upd= ated with delayed commit with be available after a restart. >>>=20 >>> -Damien >>>=20 >>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote: >>>=20 >>>> I believe it's a single delayed conflict write attempt and no successes= in that same interval. >>>>=20 >>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote: >>>>=20 >>>>> Looks like all that's necessary is a single delayed conflict write att= empt, and all subsequent delayed commits won't be commit, the header never g= ets written. >>>>>=20 >>>>> 1.0 loses data. This is ridiculously bad. >>>>>=20 >>>>> We need a test to reproduce this and fix. >>>>>=20 >>>>> -Damien >>>>>=20 >>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote: >>>>>=20 >>>>>> Good sleuthing guys, and my apologies for letting this through. Rand= all, your patch in COUCHDB-794 was actually fine, it was my reworking of it t= hat caused this serious bug. >>>>>>=20 >>>>>> With respect to that gist 513282, I think it would be better to retur= n Db#db{waiting_delayed_commit=3Dnil} when the headers match instead of movi= ng the cancel_timer() command as you did. After all, we did perform the che= ck here -- it was just that nothing needed to be committed. >>>>>>=20 >>>>>> Adam >>>>>>=20 >>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote: >>>>>>=20 >>>>>>> Yes, I think it requires 2 conflicting writes in row, because it nee= ds to trigger the delayed_commit timer without actually having anything to c= ommit, so the header never changes. >>>>>>>=20 >>>>>>> Try to reproduce this and add a test case. >>>>>>>=20 >>>>>>> -Damien >>>>>>>=20 >>>>>>>=20 >>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote: >>>>>>>=20 >>>>>>>> I think you may be right, Damien. >>>>>>>> If ever a write happens that only contains conflicts while waiting f= or >>>>>>>> a delayed commit message we might still be cancelling the timer. Is= >>>>>>>> this what you're thinking? This would be the fix: >>>>>>>> http://gist.github.com/513282 >>>>>>>>=20 >>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz wrote= : >>>>>>>>> I think the problem might be that 2 conflicting write attempts in r= ow can leave the #db.waiting_delayed_commit set but the timer has been cance= lled. One that happens, the header may never be written, as it always thinks= a delayed commit will fire soon. >>>>>>>>>=20 >>>>>>>>> -Damien >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote: >>>>>>>>>=20 >>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds wrote: >>>>>>>>>>> I agree completely! I immediately thought of this because I wrot= e that >>>>>>>>>>> change. I spent a while staring at it last night but still can't= >>>>>>>>>>> imagine how it's a problem. >>>>>>>>>>>=20 >>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz wr= ote: >>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further. >>>>>>>>>>>>=20 >>>>>>>>>>>> -Damien >>>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>> I still want to stare at r954043, but it looks to me like there's= at >>>>>>>>>> least one situation where we do not commit data correctly during >>>>>>>>>> compaction. This has to do with the way we now use the path to sy= nc >>>>>>>>>> outside the couch_file:process. Check this diff: >>>>>>>>>> http://gist.github.com/513081 >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>=20 >>>>>>=20 >>>>>=20 >>>>=20 >>>=20 >>>=20 >=20