couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Corrupted database example file
Date Fri, 19 Apr 2013 19:40:37 GMT
Doubtful that delayed commits would cause this. This isn't a matter of
reordered writes or some writes not making it to disk. The binary
would've been pushed towards disk in a single write request and the
corruption appears to be in the middle of valid data which is a bit
weird.

My guess is this was either corrupted in RAM somehow before it hit
disk or somehow the disk is returning bad reads. I've seen similar
things before that end up preceding disk death but I'm also running a
comparatively older code base (most importantly, no snappy).

On Fri, Apr 19, 2013 at 11:42 AM, Wendall Cada <wendallc@apache.org> wrote:
> If using the defaults isn't this set to delayed_commits = true still? Can't
> this lead to just this type of data corruption? I'd like to see
> delayed_commits = false and see if this is still happening.
>
> I'd also be keen on seeing this data replicated to a different piece of
> hardware with the same compaction schedule and see if the issue persists.
> I'm inclined to point the finger at a hard disk issue, but would like to see
> some confirmation that this can be reproduced with the same exact code on
> different hardware.
>
> I've run this same version heavily in production on several different
> systems doing essentially the same thing and have never seen a data
> corruption. The main difference is I always use delayed_commits = false
>
> Wendall
>
>
> On 04/19/2013 01:31 AM, Dave Cottlehuber wrote:
>>
>> On 19 April 2013 00:41, Victor Nicollet <vnicollet@runorg.com> wrote:
>>>
>>> I searched the logs for any signs of error. The operations performed on
>>> the
>>> prod-folder database in the two hours before the first crash were :
>>>
>>> https://gist.github.com/VictorNicollet/878d0176960cc71d9ac1
>>>
>>> The compact at 10:54:08 finished without a hitch.
>>> The compact at 11:54:07 finished with :
>>>
>>> https://gist.github.com/VictorNicollet/4d6ccd60bec2ae922a32
>>>
>> Hi Victor,
>>
>> thanks for that information.
>>
>> Can we get a working copy of the database, so we can compare the
>> corrupt compressed documents with the working ones and see if there's
>> any pattern?
>>
>> I recommend you assume there's some storage system issue and:
>>
>> - check dmesg / syslog for disk related errors
>> - fsck the filesystem where the couches are
>> - if this is a managed / hosted server you might want to get the
>> supplier to check if there are any disk / storage issues
>> - if it's not virtualised hardware, see if smartmontools tells you
>> anything useful
>>
>> If you wish, you can encrypt files using my public key,
>> http://www.apache.org/dist/couchdb/KEYS dch@ apache.org.
>>
>> A+
>> Dave
>
>

Mime
View raw message