incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Kemkes <a5s...@yahoo.com>
Subject Re: What max_dbs_open value do I need to avoid the checkpoint_commit_failure errors?
Date Thu, 19 Jul 2012 16:17:30 GMT


Does anybody have any advise or comments?

Thanks in advance.


________________________________
 From: Andreas Kemkes <a5sk4s@yahoo.com>
To: "user@couchdb.apache.org" <user@couchdb.apache.org> 
Sent: Monday, July 16, 2012 5:15 PM
Subject: What max_dbs_open value do I need to avoid the checkpoint_commit_failure errors?
 
The current max_dbs_open value is set at 600.

The server is running 112 continuous replications with the following topology:

                 +-->  F001
S(*)  --->  T  --|     ...
                 +-->  F111

(*) S is on a different host

On the first data change at the source database, the following issue was logged and the replication
between S and T died:

{checkpoint_commit_failure,<<"Target database out of sync. Try to increase max_dbs_open
at the target's server.">>}


One of the filtered replications between T and Fn died as well 2 seconds later with the same checkpoint_commit_failure
issue.  I suspect that it was the one that let the new document through its filter, but cannot
verify.


Upon restart of the replication between S and T, it ran to completion, but several of the filtered
replications died with the same issue from above.  I suspect that all filtered replications that
let the new documents through their filters were affected, but cannot verify.


After starting the failed filtered replications once more, everything runs to completion.

Another change triggers the following issue, yet the replication keeps running and the filtered
replication does not show any sign of issue:

{checkpoint_commit_failure,<<"Error updating the source checkpoint document: conflict">>}


...

[Mon, 16 Jul 2012 23:34:10 GMT] [info] [<0.27578.249>] recording a checkpoint for `S`
-> `T` at source update_seq 169029

...
[Mon, 16 Jul 2012 23:34:17 GMT] [info] [<0.28279.247>] recording a checkpoint for `T`
-> `http://Fx` at source update_seq 52930
...

Subsequent changes at the source do not trigger any other errors in the log files.

Is this last issue related to the previous ones or just coincidental?
Is there a formula that allows me to project the value I need to chose for max_dbs_open?

What is the reason that the value of 600 appears to be too low?

I also see a lot of 'GET /llfs/ 200' in the logs, probably originating from the 112 replication
- it appears they poll every 5 seconds.

Is there a parameter to reduce the interval?  I've looked and couldn't find it, but might
have missed it.

One other thing I noticed is that if you start 2 continuous replications, one with 'create_target':
true, another w/o the parameter, the replications are treated as different and not recognized
as 'already running'.  In my opinion, as 'create_target' is a null operation with an already
created database, they should be recognized as 'already running'.  What happens in the case
of 2 identical replications running?


-- Andreas
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message