couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Klo <jim....@sri.com>
Subject Re: Reduce function to perform a Union?
Date Fri, 20 May 2011 18:11:34 GMT
I was trying to not have to perform the merge outside of couch. Ultimately the problem we are
trying to solve is a problem related to supporting the implementation of time based range
queries and flow control over a contiguous range for OAI-PMH harvest support which has a requirement
for consistency in range requests.

Let me describe our application a bit. The attached image illustrates the system design, where
the light blue circles are instances of the application services. Each node is connected to
each other in an ad-hoc manner.  Some nodes may communicate bi-directionally, some may not.
Our aim is to use filtered replication via Couch to distribute documents. At each node CouchDB
is used to store a JSON document. Since the system is constantly receiving new content, storing
a timestamp with the document is pointless, when considering that there's no way to guarantee
the consistency of the global content set at any single point in time.  Updating the JSON
doc with a new timestamp at insert/update (replication) would just cause the document to replicate
again - causing a cascade effect. Currently all documents inserted into Couch are considered
immutable in our design.


Consider a timestamped documents being replicated to a node from other node, using the date
for simplicity, in the attached image. Looking at the replication timeline on a single node,
where the content is influenced at any time.

If at 3:00 I query Node X for all the Orange documents between 4/1 and 4/4, I get one object
back.  The consistency of the result changes at 4:00 if I query Node X again for all Orange
documents between 4/1 and 4/4. OAI-PMH protocol requires that the result remains consistent.

Ideally what I need to be able to execute a query like at 3:00 and 6:00 where I ask for all
the Orange documents received at between 1:00 and 3:00. Regardless of the time I can always
get the stream of documents received between a specific time slice, matching some specific
criteria.

Effectively what I need to be able to do is be able to create a view using the local sequence
of the document with other document traits (keywords, publisher, etc).  If there's a way to
do this, then my problem is solved.  I haven't found this ability to be in couch, hence, what
we've done is essentially on replication/publish, we have a change listener that inserts a
'sidecar document' containing a local timestamp (sequence) that which we can use document
linking or view collation w/ a reduce to figure out the right result set with a minimal number
of transforms to be eventually returned to the consumer.

Using a document linking method like:

function(doc) {
	if (doc.doc_type == "resource_data_timestamp" && doc.node_timestamp) {
		emit(doc.node_timestamp, { "_id": doc.resource_doc_id } );
	}
}

where the resource_data_timestamp doc looks something like:

{
   "_id": "d2018b3fe169426b95e44a5580692d5a-timestamp",
   "_rev": "1-719655db4a1df9b9efcc5edbd62289ed",
   "doc_type": "resource_data_timestamp",
   "resource_doc_id": "d2018b3fe169426b95e44a5580692d5a",
   "doc_version": "0.20.0",
   "node_timestamp": "2011-05-19T22:09:55.704004Z"
}

we can perform a range query against the view by timestamp only and get the original doc via
include_docs=true, but would then have to table scan to filter out documents that aren't "orange",
which could be millions of records, in our use case. This method tho only lets me join 2 documents
together.... We're also trying to determine a method of handling a delete policy, potentially
using a tombstone document as well - which is where the collated key with map and reduce come
into play, since there could be more than one type of "sidecar" document.

Is this making sense?  Really we're looking for any solution that we can ensure a consistent
range result with some additional filtering thats relatively practical - that most importantly
can scale.

Any advice would be great.


Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International




On May 20, 2011, at 6:59 AM, Stephen Prater wrote:

> Why do you need to reduce those docs?  In this particular exmple, you can do a range
query on the key and get the same (basically) results.
> 
> Also, I think the growth rate for reduce functions is log(rows) - so reducing the view
size by 50% is still going to run up against the limit.
> 
> On May 19, 2011, at 9:41 PM, Jim Klo wrote:
> 
>> I'm a little dumbfounded by reduce functions.
>> 
>> What I'm trying to do is take a view that has heterogeneous values and union into
a single object; logically this seems like what the reduce function should be capable of doing,
but it seems I keep getting the reduce overflow error. Effectively I'm reducing the view by
50%.
>> 
>> Consider the the simplistic scenario:
>> 
>> doc A: { _id : "abc123", type:"resource", keyword:"nasa" }
>> doc B: { _id : "abc123-timestamp", type: "timestamp", timestamp: "2011/05/19T12:00:00.0000Z",
ref_doc: "abc123" }
>> doc N: ....
>> 
>> Doc A is the original doc... Doc B is the timestamp doc referencing Doc A via ref_doc
field... Doc N is just another doc also referencing Doc A via ref_doc field.
>> 
>> I can create a view that essentially looks like:
>> 
>> Key				Value
>> ------------		------------------
>> "abc123"		{ .... doc A object .... }
>> "abc123"		{ .... doc B object .... }
>> "abc123"		{ .... doc N object .... }
>> 
>> I would expect I could build a reduced view that looks something like this:
>> 
>> Key				Value
>> ------------		------------------
>> "abc123"		{ .... merged doc .... }
>> 
>> Ultimately this goes back to an issue we have where we need the node local timestamp
of a document, without generating an event that would cause an update to doc A, causing it
to get replicated. We figure we can store local data like a timestamp then join it back with
the original doc via a view & list.
>> 
>> Is there something magical about the reduce that's not well documented? Or maybe
is there a better way to do this?  I know about using linked docs, were in my map function
you can reference the _id of the linked document in the value you can return @ 1 - 1 merge
with the include_docs=true, but don't think I can do that with N docs; or can I?
>> 
>> Jim Klo
>> Senior Software Engineer
>> Center for Software Engineering
>> SRI International
>> 
>> 
>> 
>> 
> 


Mime
View raw message