Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Apple Message framework v1278)
Subject: Re: random couch crash
From: Robert Newson <rnewson@apache.org>
In-Reply-To: <7796036E-36F8-4DE4-921D-182945B065EC@gmail.com>
Date: Tue, 7 Aug 2012 18:39:24 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <411049A3-8676-402E-9B9E-E132EE2ADEA6@apache.org>
References: <CC4795B5-E835-456A-A088-9089FC347BD0@gmail.com>
 <27371E5A-201D-4017-9E3B-4F96093748B0@dionne-associates.com>
 <7796036E-36F8-4DE4-921D-182945B065EC@gmail.com>
To: user@couchdb.apache.org

Are you running with delayed_commits=3Dtrue or false?

B.

On 7 Aug 2012, at 18:27, stephen bartell wrote:

>=20
>> Hi Stephen,
>>=20
>> Can you tell us anymore about the context, or did you start seeing =
these in the logs?
>=20
> Sure, here's some context.  This couch is part of a demo server.  It =
travels a lot and is cycled a lot.  There is one physical server, it =
consists of nginx (serving web apps and reverse proxying for couch), =
couchdb for persistence, and numerous programs which read and write to =
couch.  Traffic on couch can get very heavy.
>=20
> I didn't first see this in the logs.  Some of the web apps would grind =
to a halt, nginx would return 404, and then eventually couch would =
restart.  This would happen every couple of minutes.=20
>=20
>> By chance do you have a scenario that reproduces this? Was this db =
compacted or replicated from elsewhere?
>=20
> I wish I had a pliable scenario other than sending the server through =
taxi cabs, airlines, and pulling the power cord several times a day.  We =
haven't seen this on any of our production servers.
> This server was not subject to any replication.  Most databases on it =
are compacted often. =20
>=20
> Last night we were able to drill down to one particular program which =
was triggering the crash.  One by one, we backed up, deleted, and =
rebuilt the databases that program touched.  There was one database =
which seemed to be the culprit, lets call it History.  History is a =
dumping ground for stale docs from another db. History is almost always =
written to, and rarely read from.   We don't compact History since all =
docs in it are one revision deep.  We never replicate to or from it.  =
The only reason we deem History the culprit is because after rebuilding =
it, there hasn't been a crash for over 12 hours.
>=20
> I have an additional question.  Is it possible to turn couch logging =
off entirely, or would redirecting to dev/null suffice?  When couch =
would crash, hundreds of MB of crap would get dumped to the log. ( =
{{badmatch,{ok,<<32,50,48,48,10 =85 'hundreds of MB of crap' =85 =
,0,3,232>>}}).  Right when this dump occurred, the cpu spiked and the =
server began its downward descent.=20
>=20
> Best
>=20
>>=20
>> Thanks,
>>=20
>> Bob
>> On Aug 7, 2012, at 2:06 AM, stephen bartell <snbartell@gmail.com> =
wrote:
>>=20
>>> Hi all, could some one help shed some light on this crash I'm =
having.  I'm on v1.2, ubuntu 11.04. =20
>>>=20
>>> [Mon, 06 Aug 2012 18:29:16 GMT] [error] [<0.492.0>] ** Generic =
server <0.492.0> terminating=20
>>> ** Last message in was {pread_iolist,88385709}
>>> ** When Server state =3D=3D =
{file,{file_descriptor,prim_file,{#Port<0.2899>,79}},
>>>                            93302896}
>>> ** Reason for termination =3D=3D=20
>>> ** {{badmatch,{ok,<<32,50,48,48,10 =85 huge dump =85 ,0,3,232>>}},
>>>  [{couch_file,read_raw_iolist_int,3},
>>>   {couch_file,maybe_read_more_iolist,4},
>>>   {couch_file,handle_call,3},
>>>   {gen_server,handle_msg,5},
>>>   {proc_lib,init_p_do_apply,3}]}
>>>=20
>>> I'm not too familiar with erlang, but what I gathered from the src =
was `pread_iolist` function is used when reading anything from the disk. =
 So I think this might be a corrupt db problem.
>>>=20
>>> Thanks,
>>> Stephen Bartell
>>=20
>=20