couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul J Davis <>
Subject Re: silent view index file corruption
Date Wed, 07 Apr 2010 03:26:18 GMT

On Apr 6, 2010, at 11:20 PM, Adam Kocoloski <> wrote:

> On Apr 6, 2010, at 10:50 PM, Paul J Davis wrote:
>> This corruption was quite odd in that there wasn't a conspicuous reason for it. 
I didn't dive to deep into the whole thing so it's possible i missed something obvious. 
> The instance was unresponsive to ssh for 12 hours.  The report from AWS Support was merely
a "problem with the underlying host" followed by a recommendation to "launch a replacement
at your earliest convenience".  I don't know what the gremlins were doing behind the scenes,
but I'm not surprised the files are corrupted :)

Yeah I don't think that we should worry about high energy particles flipping bits too much

>> There are two things at play here.  How proactive should we be in provoking theseI
errors and how much should we check for situations where our data file got trounced.
>> The extreme proactive position would be equivalent to a full table scan per write
which is out of the question. So to some extent we won't be able to detect some errors until
read time which is an unknowable interval.
> I'm totally comfortable with only detecting them at read-time.
>> The other aspect is how rigorous should we check reads? This extreme would basically
require a sha1 for every read or write no matter how small, not to mention the storage overhead.
This part I'm not sure about. There's probably middle ground with crc sums and what not but
i don't see a clear answer.
> We currently store MD5 checksums with document bodies and validate them on reads.  It
hasn't proven to be an undue burden.

We do that for every doc body? Did not know that. Perhaps general append_term_md5 usage wouldn't
be as big of a deal as i feared.

> Best, Adam
>> Basically, the question is how much should we attempt to detect when hardware lies.
 I reckon that there's probably a middle ground to report when an assumption is violated and
full on table scans. Ideally such things would be fairly configurable but i sure don't see
an obvious answer.
>> On Apr 6, 2010, at 10:06 PM, Randall Leeds <> wrote:
>>> I immediately want to say 'ini file option' but I'm not sure whether to err
>>> on safety or speed.
>>> Maybe this is a good candidate for merkle trees or something else we can do
>>> throughout the view tree that might less overhead than md5 summing all the
>>> nodes? After all, most inner nodes shouldn't change most of the time. Some
>>> incremental, cheap checksum might be a worthwhile *option*.
>>> On Apr 6, 2010 6:04 PM, "Adam Kocoloski" <> wrote:
>>> Hi all, we recently had an EC2 node go AWOL for about 12 hours.  When it
>>> came back, we noticed after a few days that a number of the view indexes
>>> stored on that node were not updating.  I did some digging into the error
>>> logs and with Paul's help pieced together what was going on.  I won't bother
>>> you with all the gory details unless you ask for them, but the gist of it is
>>> that those files are corrupted.
>>> The troubling thing for me is that we only discovered the corruption when it
>>> completely broke the index updates.  In one case, it did this by rearranging
>>> the bits so that couch_file thought that the btree node it was reading from
>>> disk had an associated MD5 checksum. It didn't (no btree nodes do), and so
>>> couch_file threw a file_corruption exception.  But if the corruption had
>>> shown up in another part of the file I might never have known.  In fact,
>>> some of the other indices on that node probably are silently corrupted.
>>> You might wonder how likely it is that a file becomes corrupted but still
>>> appears to be functioning.  I checked the last modified timestamps for three
>>> broken files.  One was last modified when the node went down, but the other
>>> two had timestamps in between the node's recovery and now.  To me, that
>>> means that the view indexer was able to update those files for quite a while
>>> (~2 days) before it bumped into a part of the btree that was corrupted.
>>> I wonder what we should do about this.  My first thought is to make it
>>> optional to write  btree nodes (possibly only for view index files?) using
>>> append_term_md5 instead of append_term.  It seems like a simple patch, but I
>>> don't know a priori what the performance hit would be.  Other thoughts?
>>> Best, Adam

View raw message