From dev-return-9577-apmail-couchdb-dev-archive=couchdb.apache.org@couchdb.apache.org Wed Apr 07 03:26:42 2010 Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 454 invoked from network); 7 Apr 2010 03:26:42 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Apr 2010 03:26:42 -0000 Received: (qmail 10290 invoked by uid 500); 7 Apr 2010 03:26:42 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 10082 invoked by uid 500); 7 Apr 2010 03:26:42 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 10074 invoked by uid 99); 7 Apr 2010 03:26:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Apr 2010 03:26:41 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=AWL,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of paul.joseph.davis@gmail.com designates 209.85.221.184 as permitted sender) Received: from [209.85.221.184] (HELO mail-qy0-f184.google.com) (209.85.221.184) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Apr 2010 03:26:34 +0000 Received: by qyk14 with SMTP id 14so661790qyk.14 for ; Tue, 06 Apr 2010 20:26:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:x-mailer :mime-version:subject:date:references; bh=ndodILfBnV8q0bXHlhR6wILdejBvhg8AWc08DNZ52ik=; b=uKZI8HPsf5uNmRKs+A4x1EnbmPw577SCY2RNq6oMwxCROpRwAQDxAgxmjOtT9PslFK x+3lSwsombd4CLWSLCEhNcRqO7B6aIllzqLl87pse56VhgF/gN2/Ryqd7w6rzD+DRQlp d70Alz2BSBuKPaxOHOwCHHK09GxKgc+qmyO2Q= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:from:to:in-reply-to:content-type :content-transfer-encoding:x-mailer:mime-version:subject:date :references; b=tzvu7tFNud+eZDPSubV9/zrC6iSTX4I1a2p/mOaG19rLfiTm/cXe3hFIZjptFBXLRL 5xkvWBLPemZKDTr8RoTQXi5xKfj/Xyok4sH1dYLfpVCTVuK4f5K5chLKU1RhI8tQyZxO LywBzfHsmUyMmAVv3TuWXHS2ra8Vv41l5CSok= Received: by 10.224.65.81 with SMTP id h17mr2983624qai.112.1270610773212; Tue, 06 Apr 2010 20:26:13 -0700 (PDT) Received: from [192.168.10.20] (c-75-69-236-88.hsd1.ma.comcast.net [75.69.236.88]) by mx.google.com with ESMTPS id 7sm11733075qwb.16.2010.04.06.20.26.12 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 06 Apr 2010 20:26:12 -0700 (PDT) Message-Id: <0269BADB-01E8-43FA-A5E2-FC2958A60134@gmail.com> From: Paul J Davis To: "dev@couchdb.apache.org" In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Mailer: iPad Mail (7B367) Mime-Version: 1.0 (iPad Mail 7B367) Subject: Re: silent view index file corruption Date: Tue, 6 Apr 2010 23:26:18 -0400 References: <95914055-5A4F-40A8-A2B4-5E576C774811@apache.org> <12613985-1D41-4D59-A2A1-F1D3D9A6F29A@gmail.com> On Apr 6, 2010, at 11:20 PM, Adam Kocoloski wrote: > On Apr 6, 2010, at 10:50 PM, Paul J Davis wrote: >=20 >> This corruption was quite odd in that there wasn't a conspicuous = reason for it. I didn't dive to deep into the whole thing so it's = possible i missed something obvious.=20 >=20 > The instance was unresponsive to ssh for 12 hours. The report from = AWS Support was merely a "problem with the underlying host" followed by = a recommendation to "launch a replacement at your earliest convenience". = I don't know what the gremlins were doing behind the scenes, but I'm = not surprised the files are corrupted :) >=20 Yeah I don't think that we should worry about high energy particles = flipping bits too much here. >> There are two things at play here. How proactive should we be in = provoking theseI errors and how much should we check for situations = where our data file got trounced. >>=20 >> The extreme proactive position would be equivalent to a full table = scan per write which is out of the question. So to some extent we won't = be able to detect some errors until read time which is an unknowable = interval. >=20 > I'm totally comfortable with only detecting them at read-time. >=20 >> The other aspect is how rigorous should we check reads? This extreme = would basically require a sha1 for every read or write no matter how = small, not to mention the storage overhead. This part I'm not sure = about. There's probably middle ground with crc sums and what not but i = don't see a clear answer. >=20 > We currently store MD5 checksums with document bodies and validate = them on reads. It hasn't proven to be an undue burden. >=20 We do that for every doc body? Did not know that. Perhaps general = append_term_md5 usage wouldn't be as big of a deal as i feared. > Best, Adam >=20 >> Basically, the question is how much should we attempt to detect when = hardware lies. I reckon that there's probably a middle ground to report = when an assumption is violated and full on table scans. Ideally such = things would be fairly configurable but i sure don't see an obvious = answer. >>=20 >>=20 >> On Apr 6, 2010, at 10:06 PM, Randall Leeds = wrote: >>=20 >>> I immediately want to say 'ini file option' but I'm not sure whether = to err >>> on safety or speed. >>>=20 >>> Maybe this is a good candidate for merkle trees or something else we = can do >>> throughout the view tree that might less overhead than md5 summing = all the >>> nodes? After all, most inner nodes shouldn't change most of the = time. Some >>> incremental, cheap checksum might be a worthwhile *option*. >>>=20 >>> On Apr 6, 2010 6:04 PM, "Adam Kocoloski" = wrote: >>>=20 >>> Hi all, we recently had an EC2 node go AWOL for about 12 hours. = When it >>> came back, we noticed after a few days that a number of the view = indexes >>> stored on that node were not updating. I did some digging into the = error >>> logs and with Paul's help pieced together what was going on. I = won't bother >>> you with all the gory details unless you ask for them, but the gist = of it is >>> that those files are corrupted. >>>=20 >>> The troubling thing for me is that we only discovered the corruption = when it >>> completely broke the index updates. In one case, it did this by = rearranging >>> the bits so that couch_file thought that the btree node it was = reading from >>> disk had an associated MD5 checksum. It didn't (no btree nodes do), = and so >>> couch_file threw a file_corruption exception. But if the corruption = had >>> shown up in another part of the file I might never have known. In = fact, >>> some of the other indices on that node probably are silently = corrupted. >>>=20 >>> You might wonder how likely it is that a file becomes corrupted but = still >>> appears to be functioning. I checked the last modified timestamps = for three >>> broken files. One was last modified when the node went down, but = the other >>> two had timestamps in between the node's recovery and now. To me, = that >>> means that the view indexer was able to update those files for quite = a while >>> (~2 days) before it bumped into a part of the btree that was = corrupted. >>>=20 >>> I wonder what we should do about this. My first thought is to make = it >>> optional to write btree nodes (possibly only for view index files?) = using >>> append_term_md5 instead of append_term. It seems like a simple = patch, but I >>> don't know a priori what the performance hit would be. Other = thoughts? >>>=20 >>> Best, Adam >=20