Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 69906 invoked from network); 8 Aug 2010 00:17:05 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Aug 2010 00:17:05 -0000 Received: (qmail 90152 invoked by uid 500); 8 Aug 2010 00:17:04 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 89763 invoked by uid 500); 8 Aug 2010 00:17:04 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 89749 invoked by uid 99); 8 Aug 2010 00:17:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 08 Aug 2010 00:17:03 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.68.5.15] (HELO relay01.pair.com) (209.68.5.15) by apache.org (qpsmtpd/0.29) with SMTP; Sun, 08 Aug 2010 00:16:56 +0000 Received: (qmail 46051 invoked from network); 8 Aug 2010 00:16:34 -0000 Received: from 69.181.72.204 (HELO ?10.0.1.10?) (69.181.72.204) by relay01.pair.com with SMTP; 8 Aug 2010 00:16:34 -0000 X-pair-Authenticated: 69.181.72.204 Subject: Re: Data loss Mime-Version: 1.0 (Apple Message framework v1081) Content-Type: text/plain; charset=us-ascii From: Damien Katz In-Reply-To: Date: Sat, 7 Aug 2010 17:16:33 -0700 Cc: dev@couchdb.apache.org Content-Transfer-Encoding: quoted-printable Message-Id: <874784AD-0EFB-4E9B-AAB9-D265B2D06D8F@apache.org> References: <770C713F-BBA2-4E0C-B9BE-9441A053BCA4@apache.org> <7C48F227-12CA-477E-9581-8E87EE4C1610@apache.org> <5A99E2CB-F53E-4435-8225-578946239068@apache.org> <22319C7F-909D-47AD-94E3-F8C9C3369A9F@apache.org> <47434ED7-7E89-46E9-BF74-F4F9DFBF43AD@apache.org> <5711F16A-A8BB-499A-8DBA-AA02AF6E0BDC@apache.org> To: user@couchdb.apache.org X-Mailer: Apple Mail (2.1081) X-Virus-Checked: Checked by ClamAV on apache.org I reproduced this manually: Create document with id "x", ensure full commit (simply wait longer than = 1 sec, say 2 secs). Attempt to create document "x" again, get conflict error. Wait at least 2 secs to ensure the delayed commit attempt happens. Now create document "y". Wait at least 2 secs because the delayed commit should happen Restart server. Document "y" is now missing. The last delayed commit isn't happening. =46rom then on out, no docs = updated with delayed commit with be available after a restart. -Damien On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote: > I believe it's a single delayed conflict write attempt and no = successes in that same interval. >=20 > On Aug 7, 2010, at 7:51 PM, Damien Katz wrote: >=20 >> Looks like all that's necessary is a single delayed conflict write = attempt, and all subsequent delayed commits won't be commit, the header = never gets written. >>=20 >> 1.0 loses data. This is ridiculously bad. >>=20 >> We need a test to reproduce this and fix. >>=20 >> -Damien >>=20 >> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote: >>=20 >>> Good sleuthing guys, and my apologies for letting this through. = Randall, your patch in COUCHDB-794 was actually fine, it was my = reworking of it that caused this serious bug. >>>=20 >>> With respect to that gist 513282, I think it would be better to = return Db#db{waiting_delayed_commit=3Dnil} when the headers match = instead of moving the cancel_timer() command as you did. After all, we = did perform the check here -- it was just that nothing needed to be = committed. >>>=20 >>> Adam >>>=20 >>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote: >>>=20 >>>> Yes, I think it requires 2 conflicting writes in row, because it = needs to trigger the delayed_commit timer without actually having = anything to commit, so the header never changes. >>>>=20 >>>> Try to reproduce this and add a test case. >>>>=20 >>>> -Damien >>>>=20 >>>>=20 >>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote: >>>>=20 >>>>> I think you may be right, Damien. >>>>> If ever a write happens that only contains conflicts while waiting = for >>>>> a delayed commit message we might still be cancelling the timer. = Is >>>>> this what you're thinking? This would be the fix: >>>>> http://gist.github.com/513282 >>>>>=20 >>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz = wrote: >>>>>> I think the problem might be that 2 conflicting write attempts in = row can leave the #db.waiting_delayed_commit set but the timer has been = cancelled. One that happens, the header may never be written, as it = always thinks a delayed commit will fire soon. >>>>>>=20 >>>>>> -Damien >>>>>>=20 >>>>>>=20 >>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote: >>>>>>=20 >>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds = wrote: >>>>>>>> I agree completely! I immediately thought of this because I = wrote that >>>>>>>> change. I spent a while staring at it last night but still = can't >>>>>>>> imagine how it's a problem. >>>>>>>>=20 >>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz = wrote: >>>>>>>>> SVN commit r954043 looks suspicious. Digging further. >>>>>>>>>=20 >>>>>>>>> -Damien >>>>>>>>=20 >>>>>>>=20 >>>>>>> I still want to stare at r954043, but it looks to me like = there's at >>>>>>> least one situation where we do not commit data correctly during >>>>>>> compaction. This has to do with the way we now use the path to = sync >>>>>>> outside the couch_file:process. Check this diff: >>>>>>> http://gist.github.com/513081 >>>>>>=20 >>>>>>=20 >>>>=20 >>>=20 >>=20 >=20