couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wendall Cada <wenda...@apache.org>
Subject Re: Corrupted database example file
Date Fri, 19 Apr 2013 20:07:23 GMT
Thanks for the feedback on this Paul, this is outside of my area of 
expertise, so nice to know that this isn't symptomatic of using 
delayed_commits = true.

I also agree that this appears to be a hardware issue, and the only way 
to confirm would be to mirror the setup on some separate hardware and 
see if the issue persists.

Wendall

On 04/19/2013 12:40 PM, Paul Davis wrote:
> Doubtful that delayed commits would cause this. This isn't a matter of
> reordered writes or some writes not making it to disk. The binary
> would've been pushed towards disk in a single write request and the
> corruption appears to be in the middle of valid data which is a bit
> weird.
>
> My guess is this was either corrupted in RAM somehow before it hit
> disk or somehow the disk is returning bad reads. I've seen similar
> things before that end up preceding disk death but I'm also running a
> comparatively older code base (most importantly, no snappy).
>
> On Fri, Apr 19, 2013 at 11:42 AM, Wendall Cada <wendallc@apache.org> wrote:
>> If using the defaults isn't this set to delayed_commits = true still? Can't
>> this lead to just this type of data corruption? I'd like to see
>> delayed_commits = false and see if this is still happening.
>>
>> I'd also be keen on seeing this data replicated to a different piece of
>> hardware with the same compaction schedule and see if the issue persists.
>> I'm inclined to point the finger at a hard disk issue, but would like to see
>> some confirmation that this can be reproduced with the same exact code on
>> different hardware.
>>
>> I've run this same version heavily in production on several different
>> systems doing essentially the same thing and have never seen a data
>> corruption. The main difference is I always use delayed_commits = false
>>
>> Wendall
>>
>>
>> On 04/19/2013 01:31 AM, Dave Cottlehuber wrote:
>>> On 19 April 2013 00:41, Victor Nicollet <vnicollet@runorg.com> wrote:
>>>> I searched the logs for any signs of error. The operations performed on
>>>> the
>>>> prod-folder database in the two hours before the first crash were :
>>>>
>>>> https://gist.github.com/VictorNicollet/878d0176960cc71d9ac1
>>>>
>>>> The compact at 10:54:08 finished without a hitch.
>>>> The compact at 11:54:07 finished with :
>>>>
>>>> https://gist.github.com/VictorNicollet/4d6ccd60bec2ae922a32
>>>>
>>> Hi Victor,
>>>
>>> thanks for that information.
>>>
>>> Can we get a working copy of the database, so we can compare the
>>> corrupt compressed documents with the working ones and see if there's
>>> any pattern?
>>>
>>> I recommend you assume there's some storage system issue and:
>>>
>>> - check dmesg / syslog for disk related errors
>>> - fsck the filesystem where the couches are
>>> - if this is a managed / hosted server you might want to get the
>>> supplier to check if there are any disk / storage issues
>>> - if it's not virtualised hardware, see if smartmontools tells you
>>> anything useful
>>>
>>> If you wish, you can encrypt files using my public key,
>>> http://www.apache.org/dist/couchdb/KEYS dch@ apache.org.
>>>
>>> A+
>>> Dave
>>


Mime
View raw message