couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stephen bartell <snbart...@gmail.com>
Subject Re: random couch crash
Date Tue, 07 Aug 2012 20:18:54 GMT
we don't even "think" it started.  After starting compact we looked at the status in futon
and nothing came up.  The reason I say "think" is because compact can happen too quickly for
us to click over to status and watch it start/end.  But for this db of this size it should
have taken ~ 5-10 sec.  So we assumed it failed and went on to destroying/rebuilding the db.


On Aug 7, 2012, at 1:11 PM, Robert Newson wrote:

> 
> did compaction complete, though? I wasn't thinking of reducing the file size, but of
being able to successfully read all live data and write it back out again.
> 
> B.
> 
> On 7 Aug 2012, at 21:01, stephen bartell wrote:
> 
>> I'll consider delayed_commits.
>> 
>> The database was 85MB before compaction. We ran compact and it was still 85Mb.  So
compact didn't work.  The same db on other servers will compact ~10x its original size.
>> 
>> 
>> 
>> 
>>> I strongly suggest disabling delayed_commits on general principles (what's written
should stay written). Are you able to compact the database(s) that give this error?
>>> 
>>> B.
>>> 
>>> On 7 Aug 2012, at 18:42, stephen bartell wrote:
>>> 
>>>> delayed_commits = true
>>>> 
>>>> Stephen Bartell
>>>> 
>>>> On Aug 7, 2012, at 10:39 AM, Robert Newson wrote:
>>>> 
>>>>> Are you running with delayed_commits=true or false?
>>>>> 
>>>>> B.
>>>>> 
>>>>> On 7 Aug 2012, at 18:27, stephen bartell wrote:
>>>>> 
>>>>>> 
>>>>>>> Hi Stephen,
>>>>>>> 
>>>>>>> Can you tell us anymore about the context, or did you start seeing
these in the logs?
>>>>>> 
>>>>>> Sure, here's some context.  This couch is part of a demo server.
 It travels a lot and is cycled a lot.  There is one physical server, it consists of nginx
(serving web apps and reverse proxying for couch), couchdb for persistence, and numerous programs
which read and write to couch.  Traffic on couch can get very heavy.
>>>>>> 
>>>>>> I didn't first see this in the logs.  Some of the web apps would
grind to a halt, nginx would return 404, and then eventually couch would restart.  This would
happen every couple of minutes. 
>>>>>> 
>>>>>>> By chance do you have a scenario that reproduces this? Was this
db compacted or replicated from elsewhere?
>>>>>> 
>>>>>> I wish I had a pliable scenario other than sending the server through
taxi cabs, airlines, and pulling the power cord several times a day.  We haven't seen this
on any of our production servers.
>>>>>> This server was not subject to any replication.  Most databases on
it are compacted often.  
>>>>>> 
>>>>>> Last night we were able to drill down to one particular program which
was triggering the crash.  One by one, we backed up, deleted, and rebuilt the databases that
program touched.  There was one database which seemed to be the culprit, lets call it History.
 History is a dumping ground for stale docs from another db. History is almost always written
to, and rarely read from.   We don't compact History since all docs in it are one revision
deep.  We never replicate to or from it.  The only reason we deem History the culprit is because
after rebuilding it, there hasn't been a crash for over 12 hours.
>>>>>> 
>>>>>> I have an additional question.  Is it possible to turn couch logging
off entirely, or would redirecting to dev/null suffice?  When couch would crash, hundreds
of MB of crap would get dumped to the log. ( {{badmatch,{ok,<<32,50,48,48,10 … 'hundreds
of MB of crap' … ,0,3,232>>}}).  Right when this dump occurred, the cpu spiked and
the server began its downward descent. 
>>>>>> 
>>>>>> Best
>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Bob
>>>>>>> On Aug 7, 2012, at 2:06 AM, stephen bartell <snbartell@gmail.com>
wrote:
>>>>>>> 
>>>>>>>> Hi all, could some one help shed some light on this crash
I'm having.  I'm on v1.2, ubuntu 11.04.  
>>>>>>>> 
>>>>>>>> [Mon, 06 Aug 2012 18:29:16 GMT] [error] [<0.492.0>]
** Generic server <0.492.0> terminating 
>>>>>>>> ** Last message in was {pread_iolist,88385709}
>>>>>>>> ** When Server state == {file,{file_descriptor,prim_file,{#Port<0.2899>,79}},
>>>>>>>>                       93302896}
>>>>>>>> ** Reason for termination == 
>>>>>>>> ** {{badmatch,{ok,<<32,50,48,48,10 … huge dump …
,0,3,232>>}},
>>>>>>>> [{couch_file,read_raw_iolist_int,3},
>>>>>>>> {couch_file,maybe_read_more_iolist,4},
>>>>>>>> {couch_file,handle_call,3},
>>>>>>>> {gen_server,handle_msg,5},
>>>>>>>> {proc_lib,init_p_do_apply,3}]}
>>>>>>>> 
>>>>>>>> I'm not too familiar with erlang, but what I gathered from
the src was `pread_iolist` function is used when reading anything from the disk.  So I think
this might be a corrupt db problem.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Stephen Bartell
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message