couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Newson (JIRA)" <>
Subject [jira] [Commented] (COUCHDB-2040) Compaction fails when copying attachment
Date Wed, 29 Jan 2014 11:00:10 GMT


Robert Newson commented on COUCHDB-2040:

Ah, that's good news! The 12 doc difference, is that design docs by chance? You need admin
rights on the target to write ddocs, so a replication without that level of auth will omit
them. Otherwise, we should track that down. Is the target database missing the doc with the
corrupted attachment? Does it shrink much when you compact it?

We earlier determined that your initial bug report was either a mismatch on MD5 or on attachment
chunk length, and now we know which one.

This is probably a random bit flip on disk (something that RAID-5 cannot detect, but RAID-6
can, see for more).

Assuming your newly replicated database is working well, I think we're done here?

> Compaction fails when copying attachment
> ----------------------------------------
>                 Key: COUCHDB-2040
>                 URL:
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>            Reporter: Igor Klimer
> Orignal discussion from the user mailing list:
> Digest:
> During database compaction, the process fails at about 50% with the following error: (CouchDB 1.2.0, Windows Server 2008 R2 Enterprise).
> After server and CouchDB upgrade the error is still the same:
(CouchDB 1.5.0, Ubuntu 12.04.3 LTS (GNU/Linux 3.8.0-33-generic x86_64)).
> There was one prior attempt at compaction that failed because of insufficient disk space:
> After this initial failure, I've made sure that there's sufficient disk space for the
.compact file.
> The .compact file was always removed before trying compaction again.
> At the request of Robert Samuel Newson, I've also tried with an empty .compact file -
the results were the same:
> Our I/O subsystem consists of some RAID5 matrices - the admins claim that they've been
running error-free since inception ;) We have yet to run a parity check, since that'd require
taking the matrix offline and I'd rather not do that without exhausting other options.
> Config files from the 1.2.0/Windows server (since that's where the fault must have occured):
> default.ini:
> local.ini:
> Other than the default delayed_commits set to true, there are no options that could affect
fsync()ing and such.
> I've run:
> curl localhost:5984/ecrepo/_changes?include_docs=true
> curl localhost:5984/ecrepo/_all_docs?include_docs=true
> and both calls succeeded, which would suggest that a faulty (incorrect checksum/length)
is at fault somewhere.

This message was sent by Atlassian JIRA

View raw message