couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Marca <jma...@translab.its.uci.edu>
Subject Re: couchdb crashes silently
Date Mon, 16 Sep 2013 05:10:24 GMT
On Sun, Sep 15, 2013 at 08:04:27PM +0200, Dave Cottlehuber wrote:
> NIF scheduler issues could be a reasonable suspect;
> 
>  heart: Fri Sep 13 20:59:36 2013: heart-beat time-out, no activity for
> 15 seconds
> 
> 15 seconds is a *long* time however.
> 
> 1.4.0 needs 14B04 or higher I think due to one of our dependencies, so
> I'd suggest reverting back to that & seeing if you are having any
> other issues.
> 
> Also, probably unrelated, why is kernel polling disabled?

Honestly, on my gentoo boxes I just use the ebuild.  I have no idea
why kernel polling is false...it is whatever the default is in the
ebuild I guess.  I have no clue about whether kpoll should be enabled,
so I'm trusting the default.

Since my last email, I reverted back to Erlang R15B03 and it has been
crashing, same issues and symptoms.

I can successfully make it crash pretty much within 10
minutes by firing up the two replication jobs and running a data
processing job.  So that's something at least!

> And also likely unrelated, what sort of boxes are these running on,
> and and are your baseline performance / throughput metrics holding up?

Well, the box that is failing is a dual chip, quad core Intel
Xeon E5420 (so 8 cores total), with a measly 8 Gig of RAM (it looked
good when I built the machine years ago...)  I forget the details of
the disks, but it is writing to a 3ware hardware RAID array. 

Otherwise, Linux version 3.6.8-gentoo, gcc version 4.5.4
I haven't done a global update probably in the last 3 months or so,
but the machine is reasonably up to date.

As to your last question about baseline performance metrics...I'm a
researcher, and I've set this up so as to spread out my work on
several machines.  So my baseline performance metric is binary: works
or doesn't work.  A long time ago I was crushing a single couch server
and clogging my network, so I moved to this model of each processing
box has its own couch and let couch sync the results.  I don't really
measure throughput, as my bottleneck is the data processing step.
This is a good system when it works.

-- 

Tomorrow I will try loading up another server in the middle of an
a<->b<->c type replication, with the same databases, and see if maybe
it is something in my current "b"  machine's configuration, or whether
I can always get CouchDB to crash.

After that I will try downgrading to 14B04+, although there isn't an ebuild
for it in Gentoo's portage anymore.

Thanks for the replies.

Regards,
James

> 
> 
> On 15 September 2013 15:59, Robert Newson <rnewson@apache.org> wrote:
> > But, again, R15 is also new enough to have scheduler problems, if that turns out
to be your problem then this change should also fail the same way. I trust R14B01 through
extensive punishment, and recommend it.
> >
> > B.
> >
> > On 15 Sep 2013, at 04:14, James Marca <jmarca@translab.its.uci.edu> wrote:
> >
> >> eacce
> >

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


Mime
View raw message