couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stephen bartell <snbart...@gmail.com>
Subject Re: random couch crash
Date Tue, 07 Aug 2012 20:29:41 GMT
Hi Octavian, I usually tail -f the log while debugging on couch.  It was actually my coworker
who did the compact, determined it failed, and rebuilt the db.  I didn't observe the logs
during that process. The server is on the road right now.  Once it gets back I can grep the
log for details on that compact he attempted.


On Aug 7, 2012, at 1:21 PM, Octavian Damiean wrote:

> Hello Stephen,
> 
> Just "less" the log and let it wait for changes. That way you can inspect
> what it does.
> 
> Cheers, Octavian
> 
> On Tue, Aug 7, 2012 at 10:18 PM, stephen bartell <snbartell@gmail.com>wrote:
> 
>> we don't even "think" it started.  After starting compact we looked at the
>> status in futon and nothing came up.  The reason I say "think" is because
>> compact can happen too quickly for us to click over to status and watch it
>> start/end.  But for this db of this size it should have taken ~ 5-10 sec.
>> So we assumed it failed and went on to destroying/rebuilding the db.
>> 
>> 
>> On Aug 7, 2012, at 1:11 PM, Robert Newson wrote:
>> 
>>> 
>>> did compaction complete, though? I wasn't thinking of reducing the file
>> size, but of being able to successfully read all live data and write it
>> back out again.
>>> 
>>> B.
>>> 
>>> On 7 Aug 2012, at 21:01, stephen bartell wrote:
>>> 
>>>> I'll consider delayed_commits.
>>>> 
>>>> The database was 85MB before compaction. We ran compact and it was
>> still 85Mb.  So compact didn't work.  The same db on other servers will
>> compact ~10x its original size.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> I strongly suggest disabling delayed_commits on general principles
>> (what's written should stay written). Are you able to compact the
>> database(s) that give this error?
>>>>> 
>>>>> B.
>>>>> 
>>>>> On 7 Aug 2012, at 18:42, stephen bartell wrote:
>>>>> 
>>>>>> delayed_commits = true
>>>>>> 
>>>>>> Stephen Bartell
>>>>>> 
>>>>>> On Aug 7, 2012, at 10:39 AM, Robert Newson wrote:
>>>>>> 
>>>>>>> Are you running with delayed_commits=true or false?
>>>>>>> 
>>>>>>> B.
>>>>>>> 
>>>>>>> On 7 Aug 2012, at 18:27, stephen bartell wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>>> Hi Stephen,
>>>>>>>>> 
>>>>>>>>> Can you tell us anymore about the context, or did you
start seeing
>> these in the logs?
>>>>>>>> 
>>>>>>>> Sure, here's some context.  This couch is part of a demo
server.
>> It travels a lot and is cycled a lot.  There is one physical server, it
>> consists of nginx (serving web apps and reverse proxying for couch),
>> couchdb for persistence, and numerous programs which read and write to
>> couch.  Traffic on couch can get very heavy.
>>>>>>>> 
>>>>>>>> I didn't first see this in the logs.  Some of the web apps
would
>> grind to a halt, nginx would return 404, and then eventually couch would
>> restart.  This would happen every couple of minutes.
>>>>>>>> 
>>>>>>>>> By chance do you have a scenario that reproduces this?
Was this db
>> compacted or replicated from elsewhere?
>>>>>>>> 
>>>>>>>> I wish I had a pliable scenario other than sending the server
>> through taxi cabs, airlines, and pulling the power cord several times a
>> day.  We haven't seen this on any of our production servers.
>>>>>>>> This server was not subject to any replication.  Most databases
on
>> it are compacted often.
>>>>>>>> 
>>>>>>>> Last night we were able to drill down to one particular program
>> which was triggering the crash.  One by one, we backed up, deleted, and
>> rebuilt the databases that program touched.  There was one database which
>> seemed to be the culprit, lets call it History.  History is a dumping
>> ground for stale docs from another db. History is almost always written to,
>> and rarely read from.   We don't compact History since all docs in it are
>> one revision deep.  We never replicate to or from it.  The only reason we
>> deem History the culprit is because after rebuilding it, there hasn't been
>> a crash for over 12 hours.
>>>>>>>> 
>>>>>>>> I have an additional question.  Is it possible to turn couch
>> logging off entirely, or would redirecting to dev/null suffice?  When couch
>> would crash, hundreds of MB of crap would get dumped to the log. (
>> {{badmatch,{ok,<<32,50,48,48,10 … 'hundreds of MB of crap' … ,0,3,232>>}}).
>> Right when this dump occurred, the cpu spiked and the server began its
>> downward descent.
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Bob
>>>>>>>>> On Aug 7, 2012, at 2:06 AM, stephen bartell <snbartell@gmail.com>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi all, could some one help shed some light on this
crash I'm
>> having.  I'm on v1.2, ubuntu 11.04.
>>>>>>>>>> 
>>>>>>>>>> [Mon, 06 Aug 2012 18:29:16 GMT] [error] [<0.492.0>]
** Generic
>> server <0.492.0> terminating
>>>>>>>>>> ** Last message in was {pread_iolist,88385709}
>>>>>>>>>> ** When Server state ==
>> {file,{file_descriptor,prim_file,{#Port<0.2899>,79}},
>>>>>>>>>>                      93302896}
>>>>>>>>>> ** Reason for termination ==
>>>>>>>>>> ** {{badmatch,{ok,<<32,50,48,48,10 … huge
dump … ,0,3,232>>}},
>>>>>>>>>> [{couch_file,read_raw_iolist_int,3},
>>>>>>>>>> {couch_file,maybe_read_more_iolist,4},
>>>>>>>>>> {couch_file,handle_call,3},
>>>>>>>>>> {gen_server,handle_msg,5},
>>>>>>>>>> {proc_lib,init_p_do_apply,3}]}
>>>>>>>>>> 
>>>>>>>>>> I'm not too familiar with erlang, but what I gathered
from the
>> src was `pread_iolist` function is used when reading anything from the
>> disk.  So I think this might be a corrupt db problem.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Stephen Bartell
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 


Mime
View raw message