couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Corrupted database example file
Date Fri, 19 Apr 2013 20:20:31 GMT
Victor,

I finally remembered to ask a few of the ops guys I work with while
they were online about things to run to check for faulty hardware. The
general suggests for detecting disk errors are first to check dmesg
and /var/log/messages for anything that looks amiss, and then run fsck
and smartctl to check the filesystem integrity and smartctl will let
you know if the disk thinks its broken.

You may also want to run a RAM test on the machine. I'm told that most
BIOS's should have a utility for doing that these days. Otherwise
theres' memtest86+ that's a downloadable ISO. They say if you can to
just let that run overnight and if the machine is frozen in the
morning you've found the issue.

HTH,
Paul Davis

On Thu, Apr 18, 2013 at 5:07 PM, Victor Nicollet <vnicollet@runorg.com> wrote:
> Replying to my own mail, hoping it will end up in the same thread (I was
> not fully subscribed when I posted this, but I still read the archives).
>
> Answers to the questions you asked :
>
>  - I have no idea when the issue happened. I will try to track it down in
> the logs. I'm afraid I don't have time to filter out all customer
> information from the logs and provide them to you, though I can certainly
> grep for error dumps if you want me to. I have never seen disk-related
> errors in the log.
>  - I am running Debian x86_64 GNU/Linux, with erlang 1:15.b.1-d
>  - There are no unusual CouchDB configuration options ; the only change I
> performed was to disable reduce_limit. A perhaps notable usage aspect : all
> the databases are compacted hourly.
>  - It's not NFS. From /etc/fstab :
>
> /dev/sda1       /       ext4    errors=remount-ro       0       1
> /dev/sda2       /home   ext4    defaults                0       2
>
> The dual-partition setup is a silly default from OVH (my dedicated server
> host), so I have /var/lib/couchdb as a symlink to /home/couchdb/lib, from
> sda1 to sda2.
>
> - I can't rule out a disk issue, because I don't have a lot of experience
> with those... any obvious diagnosis command you would like me to run ? I am
> certain that I have not run out of disk space, though (still around 1TB
> free on that drive).
>
> Thank you for your patience.
>
> On 18 April 2013 14:17, Victor Nicollet <vnicollet@runorg.com> wrote:
>
>> Hello,
>>
>> The @CouchDB twitter account thought you might find this information
>> helpful.
>>
>> My SaaS start-up uses CouchDB as its primary database. Lately, I have been
>> having database corruption issues with version 1.2.0 : every few weeks, one
>> of our databases becomes corrupted, which has several negative consequences
>> (among others) :
>>
>>    - Replication of that database fails (it does not even start).
>>    - Compaction of that database fails and *freezes* the server.
>>    - Several documents in the database become inaccessible through either
>>    direct access or through _all_docs.
>>
>>  The latest affected database does not contain any information about our
>> customers, so I am allowed to release it publicly :
>>
>> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch
>>
>> This database contains 325 irretrievable documents between identifiers
>> 2xFEY0pU2Eb and 3Fn6l04G6Oa.
>> I hope this helps,
>>
>> --
>> Victor Nicollet, CTO, www.runorg.com
>>
>
>
>
> --
> Victor Nicollet, Directeur Technique, www.runorg.com

Mime
View raw message