I was trying to not have to perform the merge outside of couch. Ultimately the problem we are trying to solve is a problem related to supporting the implementation of time based range queries and flow control over a contiguous range for OAI-PMH harvest support which has a requirement for consistency in range requests.

Let me describe our application a bit. The attached image illustrates the system design, where the light blue circles are instances of the application services. Each node is connected to each other in an ad-hoc manner.  Some nodes may communicate bi-directionally, some may not. Our aim is to use filtered replication via Couch to distribute documents. At each node CouchDB is used to store a JSON document. Since the system is constantly receiving new content, storing a timestamp with the document is pointless, when considering that there's no way to guarantee the consistency of the global content set at any single point in time.  Updating the JSON doc with a new timestamp at insert/update (replication) would just cause the document to replicate again - causing a cascade effect. Currently all documents inserted into Couch are considered immutable in our design.

Consider a timestamped documents being replicated to a node from other node, using the date for simplicity, in the attached image. Looking at the replication timeline on a single node, where the content is influenced at any time.

If at 3:00 I query Node X for all the Orange documents between 4/1 and 4/4, I get one object back.  The consistency of the result changes at 4:00 if I query Node X again for all Orange documents between 4/1 and 4/4. OAI-PMH protocol requires that the result remains consistent.
Ideally what I need to be able to execute a query like at 3:00 and 6:00 where I ask for all the Orange documents received at between 1:00 and 3:00. Regardless of the time I can always get the stream of documents received between a specific time slice, matching some specific criteria.

Effectively what I need to be able to do is be able to create a view using the local sequence of the document with other document traits (keywords, publisher, etc).  If there's a way to do this, then my problem is solved.  I haven't found this ability to be in couch, hence, what we've done is essentially on replication/publish, we have a change listener that inserts a 'sidecar document' containing a local timestamp (sequence) that which we can use document linking or view collation w/ a reduce to figure out the right result set with a minimal number of transforms to be eventually returned to the consumer.

Using a document linking method like:

function(doc) {
if (doc.doc_type == "resource_data_timestamp" && doc.node_timestamp) {
emit(doc.node_timestamp, { "_id": doc.resource_doc_id } );
}
}

where the resource_data_timestamp doc looks something like:

{
   "_id": "d2018b3fe169426b95e44a5580692d5a-timestamp",
   "_rev": "1-719655db4a1df9b9efcc5edbd62289ed",
   "doc_type": "resource_data_timestamp",
   "resource_doc_id": "d2018b3fe169426b95e44a5580692d5a",
   "doc_version": "0.20.0",
   "node_timestamp": "2011-05-19T22:09:55.704004Z"
}

we can perform a range query against the view by timestamp only and get the original doc via include_docs=true, but would then have to table scan to filter out documents that aren't "orange", which could be millions of records, in our use case. This method tho only lets me join 2 documents together.... We're also trying to determine a method of handling a delete policy, potentially using a tombstone document as well - which is where the collated key with map and reduce come into play, since there could be more than one type of "sidecar" document.

Is this making sense?  Really we're looking for any solution that we can ensure a consistent range result with some additional filtering thats relatively practical - that most importantly can scale.

Any advice would be great.


Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International




On May 20, 2011, at 6:59 AM, Stephen Prater wrote:

Why do you need to reduce those docs?  In this particular exmple, you can do a range query on the key and get the same (basically) results.

Also, I think the growth rate for reduce functions is log(rows) - so reducing the view size by 50% is still going to run up against the limit.

On May 19, 2011, at 9:41 PM, Jim Klo wrote:

I'm a little dumbfounded by reduce functions.

What I'm trying to do is take a view that has heterogeneous values and union into a single object; logically this seems like what the reduce function should be capable of doing, but it seems I keep getting the reduce overflow error. Effectively I'm reducing the view by 50%.

Consider the the simplistic scenario:

doc A: { _id : "abc123", type:"resource", keyword:"nasa" }
doc B: { _id : "abc123-timestamp", type: "timestamp", timestamp: "2011/05/19T12:00:00.0000Z", ref_doc: "abc123" }
doc N: ....

Doc A is the original doc... Doc B is the timestamp doc referencing Doc A via ref_doc field... Doc N is just another doc also referencing Doc A via ref_doc field.

I can create a view that essentially looks like:

Key Value
------------ ------------------
"abc123" { .... doc A object .... }
"abc123" { .... doc B object .... }
"abc123" { .... doc N object .... }

I would expect I could build a reduced view that looks something like this:

Key Value
------------ ------------------
"abc123" { .... merged doc .... }

Ultimately this goes back to an issue we have where we need the node local timestamp of a document, without generating an event that would cause an update to doc A, causing it to get replicated. We figure we can store local data like a timestamp then join it back with the original doc via a view & list.

Is there something magical about the reduce that's not well documented? Or maybe is there a better way to do this?  I know about using linked docs, were in my map function you can reference the _id of the linked document in the value you can return @ 1 - 1 merge with the include_docs=true, but don't think I can do that with N docs; or can I?

Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International