couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Joseph Davis (JIRA)" <>
Subject [jira] Commented: (COUCHDB-968) Duplicated IDs in _all_docs
Date Tue, 30 Nov 2010 19:22:24 GMT


Paul Joseph Davis commented on COUCHDB-968:


Responding to #2 first:

Consider these two ordering of events:

1. Created db1/foo and edit it more than rev_limit times. Now has history A-B-C
2. foo is replicated db1 -> db2 History: A-B-C
3. foo is replicated db2 -> db1 History: A-B-C
4. wait 3 seconds then repeat.

Here, all is hunky dory. Writng foo with an identical revision history results in a no-op
more or less. The issue is from this progression:

1. Same as before, history is A-B-C
2. foo is replicated db1 -> db2 History: A-B-C
3. write to db1/foo, History: B-C-D
4. foo is replicated db2 -> db1 History A-B-C

Here, step four is attempting to merge A-B-C and B-C-D which results in a history of B-C'-D.
C' is actually the same revision, but with a new doc pointer and high_seq in the doc_info
record. Once this happens, it looks like a write (because of NewRevTree == OldTree is false).
This confusion is where the second update_seq is added and then things start going downhill
as described before.

To night I plan on writing a specific test for this behavior without requiring replication
(_bulk_docs interactive_edits=false) to demonstrate that I've got it figured out (or to show
that I've got no idea what's going on).

You'll notice the timing issue is in how the progression of edits is made with respect to
the replication coming back.

Now, as to number 1, what you should see and what I was seeing is that db2 has the correct
update_seq that you'd expect, N writes means update_seq=N. But db1 has update_seq = N + some_random_number.
That randomness is just in how these actual writes are ordered, but its greater because of
the history-merge-that-causes-spurious-writes (I'm pretty sure).

Your last point about reversing the order makes perfect sense because what's happening in
that case is that CouchDB is just doing a normal edit more or less. Ie, a doc with history
A-B-C, that gets an edit with history B-C-D gets merged and stemmed correctly to B-C-D and
all is hunky dory. Its of interest to note that your run of the mill every day PUT with the
previous revision is the equivalent to doing A-B-C + C-D which results in B-C-D.

I've not yet decided who the real culprit is yet. I can't point at any of the various places
and say that its exactly the bug. Only that the bug is the interaction of these two bits under
these circumstances. Fixing it could go a number of directions and I haven't managed to calibrate
my compass for the new timezone just yet.

> Duplicated IDs in _all_docs
> ---------------------------
>                 Key: COUCHDB-968
>                 URL:
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11.1, 0.11.2, 1.0, 1.0.1, 1.0.2
>         Environment: Ubuntu 10.04.
>            Reporter: Sebastian Cohnen
>            Priority: Blocker
> We have a database, which is causing serious trouble with compaction and replication
(huge memory and cpu usage, often causing couchdb to crash b/c all system memory is exhausted).
Yesterday we discovered that db/_all_docs is reporting duplicated IDs (see [1]). Until a few
minutes ago we thought that there are only few duplicates but today I took a closer look and
I found 10 IDs which sum up to a total of 922 duplicates. Some of them have only 1 duplicate,
others have hundreds.
> Some facts about the database in question:
> * ~13k documents, with 3-5k revs each
> * all duplicated documents are in conflict (with 1 up to 14 conflicts)
> * compaction is run on a daily bases
> * several thousands updates per hour
> * multi-master setup with pull replication from each other
> * delayed_commits=false on all nodes
> * used couchdb versions 1.0.0 and 1.0.x (*)
> Unfortunately the database's contents are confidential and I'm not allowed to publish
> [1]: Part of http://localhost:5984/DBNAME/_all_docs
> ...
> {"id":"9997","key":"9997","value":{"rev":"6096-603c68c1fa90ac3f56cf53771337ac9f"}},
> {"id":"9999","key":"9999","value":{"rev":"6097-3c873ccf6875ff3c4e2c6fa264c6a180"}},
> {"id":"9999","key":"9999","value":{"rev":"6097-3c873ccf6875ff3c4e2c6fa264c6a180"}},
> ...
> [*]
> There were two (old) servers (1.0.0) in production (already having the replication and
compaction issues). Then two servers (1.0.x) were added and replication was set up to bring
them in sync with the old production servers since the two new servers were meant to replace
the old ones (to update node.js application code among other things).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message